Comment 5 for bug 782930

Revision history for this message
Matt Giuca (mgiuca) wrote :

So the above analysis was based on RFC 2936, which it turns out is actually the old URI spec, obsoleted by RFC 3986. 3986 changes the syntax quite a bit -- for one thing I was wondering above why "#" was "excluded". It turns out in the old document, "#" isn't actually part of the URI, it's separate. In 3986, "#" is a reserved character, same as "?". And the rest of the syntax is defined quite differently. There are also some unreserved characters that are now reserved. So I looked at the above "safe set" again, and it turns out the safe set we chose is still exactly right, but for different reasons.

The relevant syntax is:

   URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
   hier-part = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-rootless = segment-nz *( "/" segment )
   path-empty = 0<pchar>
   segment = *pchar
   segment-nz = 1*pchar
   pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
   query = *( pchar / "/" / "?" )
   fragment = *( pchar / "/" / "?" )
   unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

So basically we can fold a lot of cases into one: we consider everything after the "javascript" to be a path (with zero or more slashes), followed by an optional "?" query and an optional "#" fragment. The path part is allowed to contain any of the following characters unescaped (assuming we don't place any special emphasis on "/"):

ALPHA / DIGIT / "-" / "." / "_" / "~" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@" "/"

This notably excludes "?" and "#". But, assuming we aren't going to parse the query string specially (and it looks like browsers don't, for javascript: URIs), we are allowed to have exactly one "?". After which, we can have more characters in the above set, as well as "?". Which means we are effectively allowed as many "?" characters as we like.

Does the same apply for "#"? No. For some reason (perhaps oversight), "fragment" does not include "#" -- you aren't allowed to have a "#" in a URL after the first "#". So we shouldn't allow "#" to appear unescaped in the URI.

Also, as above, we remove "&" and "'" for other reasons -- they can't appear unescaped in an XML attribute.

Therefore, ignoring alphanumeric characters and "_", "." and "-" (always safe in Python-land), we want the following safe set:

;/?:@=+$,!~*()

That's the exact same set we already have.