accept Unicode pseudo-URLs and treat as UTF-8

Bug #42514 reported by Martin Pool
2
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Medium
John A Meinel

Bug Description

URLs are technically just ascii, but it's reasonably common to have encoded Unicode in them. We can translate to and from Unicode pseudo-urls for user input/output, subject to some limitations. Because this is based on some assumptions, we should only use proper URLs in the program and in storage.

See thread

https://lists.ubuntu.com/archives/bazaar-ng/2006q2/011104.html

and replies.

Revision history for this message
Martin Pool (mbp) wrote :

In more detail:

In places where a URL is expected:

- determine that it's a URL, not a local filename, by looking for "scheme://" for some known scheme

- replace non-ascii characters with their UTF-8 representation

- replace reserved characters with their urlescaped representation

Refinements where this process is performed in each path component are possible but I'm not sure they're necessary.

Special handling may be required for Unicode domain names.

Revision history for this message
Martin Pool (mbp) wrote : pseudocode

  for c in pseudo_url:
    if c in url_safe_characters:
      r += c
    else:
      if isinstance(c, unicode):
        r += urlescape(c.encode('utf-8'))

Revision history for this message
John A Meinel (jameinel) wrote :

This should be committed in my encoding branch.

Changed in bzr:
assignee: nobody → jameinel
status: Unconfirmed → Fix Committed
John A Meinel (jameinel)
Changed in bzr:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.