Comment 1 for bug 388028

Revision history for this message
Michael Terry (mterry) wrote :

I think the best way to do this is to have PyICU include the Transform classes in their bindings, use those with the following transform: "lower; latin; nfkd" and hand remove anything that isn't a legal username character ([^-_a-zA-Z]). This will remove accents and such composing characters.

This will still need special handling for some characters, including the example given of ø. My testing and a IBM FAQ entry [1] indicate that there are several special cases that normal Unicode transform doesn't do right. So we'll have to hand-transform some things, like ß and æ. Basically anything listed in the IBM article.

BTW, you can play around with Unicode transforms online [2]. It's pretty interesting. For our purposes, using the 'Names' data is particularly relevant.

Unfortunately, the PyICU bindings do *not* have the Transform bits of ICU wrapped yet.

[1] http://ibm.com/support/docview.wss?uid=swg21247569
[2] http://demo.icu-project.org/icu-bin/translit