Comment 24 for bug 1172106

Revision history for this message
John Dennis (jdennis-a) wrote :

To properly fix this we cannot just globally change the default
encoding, that is a temporary workaround not a structural fix
consistent with OpenStack coding practice and Python3 semantics.

This is a sequence 4 patches. The full commit was broken down into the
4 patches to facilitate review where each patch implements one phase
of the total fix. Please see each commit message for details and
rationale for the change.

The correct way to handle non-ascii characters is to always use
unicode strings. In Python2 this requires the use of the unicode
string object instead of the str string object. In Python3 all strings
are unicode (str objects are actually unicode and what was str in
Python2 becomes bytes object in Python3). Thus all strings in
OpenStack code should be unicode in Python2 and will by definition be
unicode in Python3.

External library interfacess are often specified to require UTF-8
encoding for strings. This is because UTF-8 encoding is a byte (octet)
stream and a proper subset of ASCII. This is especially true of
libraries written in C or that implement RFC's whose specification
specifies strings are UTF-8 encoded, LDAP, XML, HTTP, etc. are common
examples.

The natural consequence of this is Python maintains it's strings as
unicode (either UCS-2 or UCS-4) and conversion to/from UTF-8 occurs at
I/O and/or API boundaries, in other words when string data is entering
or leaving the "python domain".

python-ldap is the standard LDAP API for interacting with LDAP from
Python. python-ldap requires UTF-8 encoded strings. It
would have been ideal if inside the python-ldap API binding it
converted unicode strings to UTF-8 but it doesn't and this unfortunate
omission requires us to do the conversion when calling LDAP and on the
data returned from LDAP. The fact the python-ldap API does not perform
UTF-8 conversion just means doing the conversion ourselves is
consistent with any other API or I/O boundary requiring UTF-8.

To expedite LDAP testing without requiring a running live LDAP server
a fake LDAP API was introduced which emulates LDAP. Unfortunately the
fake LDAP is a poor emulation. For example it does not demand all LDAP
data be converted to strings nor that strings are UTF-8 encoded. This
meant a considerable portion of the LDAP unit tests were not catching
potential problems with data types being passed through the LDAP
API. Many of these problems only showed up during the occassional
testing against a live LDAP server using the python-ldap interface.

To address these issues the following was done:

* An abstract LDAP interface was defined. Both fake ldap and live ldap
  implement this interface. The interface requires UTF-8 encoded
  strings.

* An instance of the same abstract LDAP interface was implemented
  whose job it is to perform type conversion and logging, then then
  call one of the LDAP instances to perform the actual LDAP
  operation. Note, type conversion includes other things besides UTF-8
  conversion, it also includes converting Python types such as
  booleans, integers, etc. to a string representation.

* The test coverage for non-ascii values was greatly expanded.