qrify

Bug #782930
Comment #3

Comment 3 for bug 782930

Revision history for this message

Matt Giuca (mgiuca) wrote on 2011-06-02:

So if you use urllib.parse.quote, you end up with a huge mess of basically everything percent-encoded (including parens). I think we should encode the minimum set of characters necessary, so here is a bit of an analysis.

There are two standards at work here: URI syntax (RFC 2396) and XML attribute value syntax (W3C XML 1.0 Fifth Edition). Note that commonly, people encode URIs and then slap them into a href="" attribute, assuming that a valid URI is "safe", but it actually isn't, because URIs may include ampersands ("&") and single quotes ("'"), which are illegal in XML attribute values (note that single quotes are only illegal in an attribute value delimited by single quotes, but that is still a possibility). Therefore, we should not allow URIs with bare ampersands or single quotes.

As for URI syntax, RFC 2396 divides the ASCII characters into three groups: reserved, unreserved, and excluded. These sets are, in full:

reserved: ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved: alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
excluded: all control characters (<U+0020 and U+007F) | " " | <"> | "#" | "%" | "\" | "<" | ">" | "[" | "]" | "^" | "`" | "|"

It's weird that "#" and "%" are "excluded" rather than "reserved", since they do appear in URIs. All of the other excluded characters are totally illegal.

Anyway, my reading of "reserved" is "they can appear in URIs but they *may* have special meaning, depending on the scheme and the part of the URI in question, and "unreserved" is "they may freely appear in URIs and have precisely the same meaning as their escaped versions". Since the syntax of our scheme ("javascript") is the JavaScript language, I would assume we don't need to worry about *any* of the reserved characters being present. In other words, if the string "http://" appears unquoted in a javascript URI, it shouldn't matter. Therefore, we should allow all reserved and unreserved characters to appear unescaped, and only escape the excluded characters, BUT with the above caveat that we also escape "&" and "'".

Since Python's urllib.parse.quote function by default escapes everything but alphanumeric characters and "_", "." and "-", we should supply the following safe set:
";/?:@=+$,!~*()"

As for URI syntax, RFC 2396 divides the ASCII characters into three groups: reserved, unreserved, and excluded. These sets are, in full:

It's weird that "#" and "%" are "excluded" rather than "reserved", since they do appear in URIs. All of the other excluded characters are totally illegal.

Since Python's urllib.parse.quote function by default escapes everything but alphanumeric characters and "_", "." and "-", we should supply the following safe set:
";/?:@=+$,!~*()"