URI() does not encode properly with percent-encoding

Bug #1102177 reported by Xan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lazr.uri
Invalid
Undecided
Unassigned

Bug Description

import lazr_uri # this is local copy

# https://en.wikipedia.org/wiki/Percent-encoding
a = lazr_uri.URI('http://en.wikipedia.org/wiki/Operators_in_C_and_C++')
c = lazr_uri.URI('http://ca.wikipedia.org/wiki/Llaç')
b = lazr_uri.URI('http://ca.wikipedia.org/wiki/Lla%C3%A7')
print(a, b, c)

returns URIs must consist of ASCII characters.

It's supposing that lazr.uri should encode characters itself (clearly c == b).

Can someone improve lazr.uri for encode properly unreserved chars and constructor works good?

Thanks in advance,
Xan.

Xan (dxpublica)
description: updated
Revision history for this message
William Grant (wgrant) wrote :

c isn't a URI, because it contains non-ASCII characters. lazr.uri handles URIs.

Changed in lazr.uri:
status: New → Invalid
Revision history for this message
Xan (dxpublica) wrote :

Wrong. Stricly c is not an uri, yes, but in RFC 3986 there is an specification for transform any encoding to % HEXDIGIT HEXDIGIT (Section 2.1 -> 2.5). So c become b.

Revision history for this message
William Grant (wgrant) wrote : Re: [Bug 1102177] Re: URI() does not encode properly with percent-encoding

On 21/01/13 17:07, Xan wrote:
> Wrong. Stricly c is not an uri, yes, but in RFC 3986 there is an
> specification for transform any encoding to % HEXDIGIT HEXDIGIT (Section
> 2.1 -> 2.5). So c become b.

From section 2.4:

"""
   Under normal circumstances, the only time when octets within a URI
   are percent-encoded is during the process of producing the URI from
   its component parts. This is when an implementation determines which
   of the reserved characters are to be used as subcomponent delimiters
   and which can be safely used as data. Once produced, a URI is always
   in its percent-encoded form.
"""

Revision history for this message
Xan (dxpublica) wrote :

What do you mean?

Are lazr.URI('http://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B') and lazr_uri.URI('http://en.wikipedia.org/wiki/Operators_in_C_and_C++') equivalent?

I tell it another way, is there a way to transform c into a percent-encoded ASCII valid URI?

Regards,
Xan.

Revision history for this message
William Grant (wgrant) wrote :

On 21/01/13 20:07, Xan wrote:
> What do you mean?
>
> Are lazr.URI('http://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B')
> and lazr_uri.URI('http://en.wikipedia.org/wiki/Operators_in_C_and_C++')
> equivalent?

+ is a reserved character. From RFC 3986 section 2.2 "Reserved Characters":

"""
   Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.
"""

From RFC 2616 section 3.2.3 "URI Comparison" (note that RFC 3986
supersedes the referenced RFC 2396):

"""
   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.
"""

So, no, those two are not equivalent. "+" is a reserved character, so it
cannot be translated to or from "%2B" without changing the URL's meaning.

Revision history for this message
Xan (dxpublica) wrote :

Oh, thanks, but what about c -> b conversion? These two uri are equivalent, isn't?

I think, what I want is IRI to URI conversion:
I found iri_to_uri function in https://github.com/django/django/blob/master/django/utils/encoding.py

Perhaps you could include in your code.

Thanks for discussing with me.
Xan.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.