Bug #1720115 “CSV formatter not working with unicode” : Bugs : cliff

Revision history for this message

Julie Pichon (jpichon) wrote on 2017-09-28:

#1

With unicodecsv:
$ openstack project list --format=csv
[...]
'ascii' codec can't decode byte 0xc3 in position 36: ordinal not in range(128)

After removing it:
$ openstack project list --format=csv
[...]
'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Revision history for this message

Julie Pichon (jpichon) wrote on 2017-09-28:

#2

Traceback for the original error, if that helps:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 115, in run
    self.produce_output(parsed_args, column_names, data)
  File "/usr/lib/python2.7/site-packages/cliff/lister.py", line 53, in produce_output
    parsed_args,
  File "/usr/lib/python2.7/site-packages/cliff/formatters/commaseparated.py", line 62, in emit_list
    for c in row]
  File "/usr/lib/python2.7/site-packages/unicodecsv/py2.py", line 86, in writerow
    _stringify_list(row, self.encoding, self.encoding_errors))
  File "/usr/lib64/python2.7/codecs.py", line 369, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 36: ordinal not in range(128)

https://github.com/openstack/cliff/blob/f2c381/cliff/formatters/commaseparated.py#L49-L63

(When removing unicodecsv, the failure happens earlier, directly in emit_list:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 115, in run
    self.produce_output(parsed_args, column_names, data)
  File "/usr/lib/python2.7/site-packages/cliff/lister.py", line 53, in produce_output
    parsed_args,
  File "/usr/lib/python2.7/site-packages/cliff/formatters/commaseparated.py", line 62, in emit_list
    for c in row]
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128))

Revision history for this message

John Dennis (jdennis-a) wrote on 2017-09-28:

#3

I've debugged what is occurring and will capture the information here. We'll have to figure out a solution later.

The standard csv module in Py2 cannot accept unicode strings which in and of itself is very odd. The fundamental reason for this is a portion of the module is written in C and at the moment I can't explain the necessity of using CPython for part of the implementation but be that as it may ...

The reason why csv cannot accept unicode during writes can been seen in Modules/_csv.c in the function csv_writerow(). It does a PyString_Check(field) on line 1182 which fails for a unicode string (because PyString_Check only returns true for str objects) and then later tries to convert the object to a str by calling PyObject_Str(field) on line 1198. PyObject_Str(field) is equivalent to to str(field) in Python. To convert a unicode object to str the default-encoding of ASCII is applied which of course fails whenever the unicode string contains non-ascii characters. The _csv CPython module could be unicode aware and it wouldn't be hard but it isn't, so not much we can do there.

Therefore to get around the problem of the CPython _csv.so being unable to accept unicode a new module called unicodecsv was introduced. unicodecsv encodes all unicode objects to utf-8, the result of the utf-8 encoding is a str object thus the CPython _csv.so module is happy.

But what happens when the CPython _csv.so module tries to write the row data? It invokes the write method of the StreamWriter.utf-8 class. That class encodes to utf-8. One would expect it would only encode unicode objects and would pass str objects through unmodified (on the assumption a str object already has be utf-8 encoded). But it actually attempts to encode every object. Therefore when it receives a utf-8 encoded str object it tries to encode it again, but to encode it has to be unicode, so the object is first promoted to unicode using the default encoding of ASCII which of course fails and hence the codec error.

At this point one might ask is it a mistake to substitute StreamWriter.utf-8 for stdout? If we don't do that then we'll get codec errors whenever unicode strings are emitted containing non-ascii characters. That would require finding *every* code location that emitted unicode and add .encode('utf-8') which for obvious reasons we don't want to do.

I want to ponder this a bit more but I think the solution is to keep the unicodecsv solution and the StreamWriter but use a StreamWriter that only encodes unicode objects and allows str objects to pass through.

It's a shame str objects are not tagged with their encoding so we could know in a StreamWriter if a str object in some other encoding needs to be re-encoded into utf-8, but there isn't so we just have to assume it's already properly encoded in utf-8.

BTW, as you can see this is a GIANT MESS in Py2 and cleaning this nonsense up was the driving force behind Py3 where *all* text is represented as unicode objects and text encoding properly occurs at I/O boundaries instead of random programmer inserted places.