Bug #782930 “Bookmarklet is not URI safe for including on a page...” : Bugs : qrify

Revision history for this message

David Coles (dcoles) wrote on 2011-05-15:

#1

Adding Matt since I'm kind of curious to hear his thoughts, since he wrote the bookmarklet.py script.

Revision history for this message

Matt Giuca (mgiuca) wrote on 2011-05-15:

#2

Don't worry. I read all the bugs anyway.

Yes, I suppose this is true. Note that I originally wrote bookmarklet.py using urllib.quote (of course ;) I wrote it!) But then I noticed that I was escaping a lot of characters you weren't, such as quotes and things. Your version only escaped spaces, so I changed it to simply replace space with %20. I think I looked at some other bookmarklets and they also only escaped space.

But I would be more than happy to use urllib.quote, and I doubt it would cause any problems.

Revision history for this message

Matt Giuca (mgiuca) wrote on 2011-06-02:

#3

So if you use urllib.parse.quote, you end up with a huge mess of basically everything percent-encoded (including parens). I think we should encode the minimum set of characters necessary, so here is a bit of an analysis.

There are two standards at work here: URI syntax (RFC 2396) and XML attribute value syntax (W3C XML 1.0 Fifth Edition). Note that commonly, people encode URIs and then slap them into a href="" attribute, assuming that a valid URI is "safe", but it actually isn't, because URIs may include ampersands ("&") and single quotes ("'"), which are illegal in XML attribute values (note that single quotes are only illegal in an attribute value delimited by single quotes, but that is still a possibility). Therefore, we should not allow URIs with bare ampersands or single quotes.

As for URI syntax, RFC 2396 divides the ASCII characters into three groups: reserved, unreserved, and excluded. These sets are, in full:

reserved: ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved: alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
excluded: all control characters (<U+0020 and U+007F) | " " | <"> | "#" | "%" | "\" | "<" | ">" | "[" | "]" | "^" | "`" | "|"

It's weird that "#" and "%" are "excluded" rather than "reserved", since they do appear in URIs. All of the other excluded characters are totally illegal.

Anyway, my reading of "reserved" is "they can appear in URIs but they *may* have special meaning, depending on the scheme and the part of the URI in question, and "unreserved" is "they may freely appear in URIs and have precisely the same meaning as their escaped versions". Since the syntax of our scheme ("javascript") is the JavaScript language, I would assume we don't need to worry about *any* of the reserved characters being present. In other words, if the string "http://" appears unquoted in a javascript URI, it shouldn't matter. Therefore, we should allow all reserved and unreserved characters to appear unescaped, and only escape the excluded characters, BUT with the above caveat that we also escape "&" and "'".

Since Python's urllib.parse.quote function by default escapes everything but alphanumeric characters and "_", "." and "-", we should supply the following safe set:
";/?:@=+$,!~*()"

So if you use urllib.parse.quote, you end up with a huge mess of basically everything percent-encoded (including parens). I think we should encode the minimum set of characters necessary, so here is a bit of an analysis.

There are two standards at work here: URI syntax (RFC 2396) and XML attribute value syntax (W3C XML 1.0 Fifth Edition). Note that commonly, people encode URIs and then slap them into a href="" attribute, assuming that a valid URI is "safe", but it actually isn't, because URIs may include ampersands ("&") and single quotes ("'"), which are illegal in XML attribute values (note that single quotes are only illegal in an attribute value delimited by single quotes, but that is still a possibility). Therefore, we should not allow URIs with bare ampersands or single quotes.

As for URI syntax, RFC 2396 divides the ASCII characters into three groups: reserved, unreserved, and excluded. These sets are, in full:

reserved: ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved: alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
excluded: all control characters (<U+0020 and U+007F) | " " | <"> | "#" | "%" | "\" | "<" | ">" | "[" | "]" | "^" | "`" | "|"

It's weird that "#" and "%" are "excluded" rather than "reserved", since they do appear in URIs. All of the other excluded characters are totally illegal.

Anyway, my reading of "reserved" is "they can appear in URIs but they *may* have special meaning, depending on the scheme and the part of the URI in question, and "unreserved" is "they may freely appear in URIs and have precisely the same meaning as their escaped versions". Since the syntax of our scheme ("javascript") is the JavaScript language, I would assume we don't need to worry about *any* of the reserved characters being present. In other words, if the string "http://" appears unquoted in a javascript URI, it shouldn't matter. Therefore, we should allow all reserved and unreserved characters to appear unescaped, and only escape the excluded characters, BUT with the above caveat that we also escape "&" and "'".

Since Python's urllib.parse.quote function by default escapes everything but alphanumeric characters and "_", "." and "-", we should supply the following safe set:
";/?:@=+$,!~*()"

Revision history for this message

Matt Giuca (mgiuca) wrote on 2011-06-02:

#4

Fixed in trunk r18 by David Coles. I tweaked it in trunk r20.

Changed in qrify:
status:	Triaged → Fix Committed

Revision history for this message

Matt Giuca (mgiuca) wrote on 2011-07-01:

#5

So the above analysis was based on RFC 2936, which it turns out is actually the old URI spec, obsoleted by RFC 3986. 3986 changes the syntax quite a bit -- for one thing I was wondering above why "#" was "excluded". It turns out in the old document, "#" isn't actually part of the URI, it's separate. In 3986, "#" is a reserved character, same as "?". And the rest of the syntax is defined quite differently. There are also some unreserved characters that are now reserved. So I looked at the above "safe set" again, and it turns out the safe set we chose is still exactly right, but for different reasons.

The relevant syntax is:

   URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
   hier-part = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-rootless = segment-nz *( "/" segment )
   path-empty = 0<pchar>
   segment = *pchar
   segment-nz = 1*pchar
   pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
   query = *( pchar / "/" / "?" )
   fragment = *( pchar / "/" / "?" )
   unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

So basically we can fold a lot of cases into one: we consider everything after the "javascript" to be a path (with zero or more slashes), followed by an optional "?" query and an optional "#" fragment. The path part is allowed to contain any of the following characters unescaped (assuming we don't place any special emphasis on "/"):

ALPHA / DIGIT / "-" / "." / "_" / "~" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@" "/"

This notably excludes "?" and "#". But, assuming we aren't going to parse the query string specially (and it looks like browsers don't, for javascript: URIs), we are allowed to have exactly one "?". After which, we can have more characters in the above set, as well as "?". Which means we are effectively allowed as many "?" characters as we like.

Does the same apply for "#"? No. For some reason (perhaps oversight), "fragment" does not include "#" -- you aren't allowed to have a "#" in a URL after the first "#". So we shouldn't allow "#" to appear unescaped in the URI.

Also, as above, we remove "&" and "'" for other reasons -- they can't appear unescaped in an XML attribute.

Therefore, ignoring alphanumeric characters and "_", "." and "-" (always safe in Python-land), we want the following safe set:

;/?:@=+$,!~*()

That's the exact same set we already have.

So the above analysis was based on RFC 2936, which it turns out is actually the old URI spec, obsoleted by RFC 3986. 3986 changes the syntax quite a bit -- for one thing I was wondering above why "#" was "excluded". It turns out in the old document, "#" isn't actually part of the URI, it's separate. In 3986, "#" is a reserved character, same as "?". And the rest of the syntax is defined quite differently. There are also some unreserved characters that are now reserved. So I looked at the above "safe set" again, and it turns out the safe set we chose is still exactly right, but for different reasons.

The relevant syntax is:

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>
   segment       = *pchar
   segment-nz    = 1*pchar
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   query         = *( pchar / "/" / "?" )
   fragment      = *( pchar / "/" / "?" )
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

So basically we can fold a lot of cases into one: we consider everything after the "javascript" to be a path (with zero or more slashes), followed by an optional "?" query and an optional "#" fragment. The path part is allowed to contain any of the following characters unescaped (assuming we don't place any special emphasis on "/"):

ALPHA / DIGIT / "-" / "." / "_" / "~" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@" "/"

This notably excludes "?" and "#". But, assuming we aren't going to parse the query string specially (and it looks like browsers don't, for javascript: URIs), we are allowed to have exactly one "?". After which, we can have more characters in the above set, as well as "?". Which means we are effectively allowed as many "?" characters as we like.

Does the same apply for "#"? No. For some reason (perhaps oversight), "fragment" does not include "#" -- you aren't allowed to have a "#" in a URL after the first "#". So we shouldn't allow "#" to appear unescaped in the URI.

Also, as above, we remove "&" and "'" for other reasons -- they can't appear unescaped in an XML attribute.

Therefore, ignoring alphanumeric characters and "_", "." and "-" (always safe in Python-land), we want the following safe set:

;/?:@=+$,!~*()

That's the exact same set we already have.

qrify

Bookmarklet is not URI safe for including on a page

Bug Description

Related branches

Other bug subscribers

Remote bug watches