CSV formatter not working with unicode

Bug #1720115 reported by Julie Pichon
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cliff
Fix Released
Undecided
John Dennis

Bug Description

Testing on a recent devstack, it looks like the CSV formatter no longer supports unicode:

$ openstack project create éèë
$ openstack project list --format csv
"ID","Name"
"21936b57efce4c5e87d402bc9fbfed44","alt_demo"
"5a6f97ac785d4150bf33e78ae591a444","invisible_to_admin"
"5e329abfc39c4fc5ac13b1e0765d9222","admin"
"6d4639dfdad34977887f2a3fc45a62aa","service"
'ascii' codec can't decode byte 0xc3 in position 36: ordinal not in range(128)

I understand this should be fixed as of bug 1481014 but it doesn't seem to be the case. I could get it to work on an older environment (Mitaka with cliff 2.0.0), but not from Newton.

From what I can tell it is due to the fix for bug 1603210 [1], which was backported. When I comment out L121 [2] which was brought in as part of that patch, the csv formatter starts working again. I think perhaps we're trying to encode a string that unicodecsv already encoded?

John Dennis kindly took a look and suggested unicodecsv should no longer be necessary now that we encode properly at the stream level, though unfortunately a quick manual test suggests that won't enough on its own.

[1] https://review.openstack.org/#/c/342914/
[2] https://github.com/openstack/cliff/blob/f2c381/cliff/app.py#L120-L121

Revision history for this message
Julie Pichon (jpichon) wrote :

With unicodecsv:
$ openstack project list --format=csv
[...]
'ascii' codec can't decode byte 0xc3 in position 36: ordinal not in range(128)

After removing it:
$ openstack project list --format=csv
[...]
'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Revision history for this message
Julie Pichon (jpichon) wrote :

Traceback for the original error, if that helps:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 115, in run
    self.produce_output(parsed_args, column_names, data)
  File "/usr/lib/python2.7/site-packages/cliff/lister.py", line 53, in produce_output
    parsed_args,
  File "/usr/lib/python2.7/site-packages/cliff/formatters/commaseparated.py", line 62, in emit_list
    for c in row]
  File "/usr/lib/python2.7/site-packages/unicodecsv/py2.py", line 86, in writerow
    _stringify_list(row, self.encoding, self.encoding_errors))
  File "/usr/lib64/python2.7/codecs.py", line 369, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 36: ordinal not in range(128)

https://github.com/openstack/cliff/blob/f2c381/cliff/formatters/commaseparated.py#L49-L63

(When removing unicodecsv, the failure happens earlier, directly in emit_list:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
    result = cmd.run(parsed_args)
  File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python2.7/site-packages/cliff/display.py", line 115, in run
    self.produce_output(parsed_args, column_names, data)
  File "/usr/lib/python2.7/site-packages/cliff/lister.py", line 53, in produce_output
    parsed_args,
  File "/usr/lib/python2.7/site-packages/cliff/formatters/commaseparated.py", line 62, in emit_list
    for c in row]
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128))

Revision history for this message
John Dennis (jdennis-a) wrote :

I've debugged what is occurring and will capture the information here. We'll have to figure out a solution later.

The standard csv module in Py2 cannot accept unicode strings which in and of itself is very odd. The fundamental reason for this is a portion of the module is written in C and at the moment I can't explain the necessity of using CPython for part of the implementation but be that as it may ...

The reason why csv cannot accept unicode during writes can been seen in Modules/_csv.c in the function csv_writerow(). It does a PyString_Check(field) on line 1182 which fails for a unicode string (because PyString_Check only returns true for str objects) and then later tries to convert the object to a str by calling PyObject_Str(field) on line 1198. PyObject_Str(field) is equivalent to to str(field) in Python. To convert a unicode object to str the default-encoding of ASCII is applied which of course fails whenever the unicode string contains non-ascii characters. The _csv CPython module could be unicode aware and it wouldn't be hard but it isn't, so not much we can do there.

Therefore to get around the problem of the CPython _csv.so being unable to accept unicode a new module called unicodecsv was introduced. unicodecsv encodes all unicode objects to utf-8, the result of the utf-8 encoding is a str object thus the CPython _csv.so module is happy.

But what happens when the CPython _csv.so module tries to write the row data? It invokes the write method of the StreamWriter.utf-8 class. That class encodes to utf-8. One would expect it would only encode unicode objects and would pass str objects through unmodified (on the assumption a str object already has be utf-8 encoded). But it actually attempts to encode every object. Therefore when it receives a utf-8 encoded str object it tries to encode it again, but to encode it has to be unicode, so the object is first promoted to unicode using the default encoding of ASCII which of course fails and hence the codec error.

At this point one might ask is it a mistake to substitute StreamWriter.utf-8 for stdout? If we don't do that then we'll get codec errors whenever unicode strings are emitted containing non-ascii characters. That would require finding *every* code location that emitted unicode and add .encode('utf-8') which for obvious reasons we don't want to do.

I want to ponder this a bit more but I think the solution is to keep the unicodecsv solution and the StreamWriter but use a StreamWriter that only encodes unicode objects and allows str objects to pass through.

It's a shame str objects are not tagged with their encoding so we could know in a StreamWriter if a str object in some other encoding needs to be re-encoded into utf-8, but there isn't so we just have to assume it's already properly encoded in utf-8.

BTW, as you can see this is a GIANT MESS in Py2 and cleaning this nonsense up was the driving force behind Py3 where *all* text is represented as unicode objects and text encoding properly occurs at I/O boundaries instead of random programmer inserted places.

Revision history for this message
John Dennis (jdennis-a) wrote :

Oh, I'll point out there is one other possible solution we might consider.

Make the default encoding utf-8 instead of ASCII.

This requires a hacky trick that looks like this:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

The reason why the trick exists is because site.py explicitly removes sys.setdefaultencoding() so you can't call it because Python does not want you to call it. But if you reload sys setdefaultencoding reappears as a callable function.

This would have to be done at the very beginning of the openstack client app before anything else is loaded.

But there are very good reasons not to do this. It would only work for code executed by osc, anybody else loading the code would be out of luck. It's considered dangerous because of the way Py2 implements strings where it caches a default encoded version of a unicode object. And (potentially) some code truly is dependent upon ASCII (this is a hard thing to know except by exhaustively testing every bit of Python code that gets loaded).

The general recommendation is to avoid the trick.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote :

Thank you for the very detailed debugging report. I like the idea of using a StreamWriter that does not try to re-encode unicode objects and not str/bytes objects.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to cliff (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/508352

Revision history for this message
Oliver Walsh (owalsh) wrote :

I don't think a smarter StreamWriter would work in this case. That assumes that everyone agrees on the encoding, however unicodecsv is defaulting to utf8. If the current locale is not utf8 we would have issues.

As the (stdlib) csv module expects a file I think it's reasonable for unicodecsv to assume it's at a boundary and encode the output.

The workaround I've proposed is the least evil solution I could think of... that I believe would also work correctly with non utf8 locales:

(gnome terminal set to EUC-JP)
$ export LC_CTYPE=japanese.euc
$ openstack project create 外字
$ openstack project list --format csv
"ID","Name"
"757eba2e8346434ca008ed1bb94d4c42","admin"
"8ded8b7ef77f4a09afb8c62414ec15e5","service"
"bc7e7ab3e5054491beb6924da23301a7","外字"

I'd also argue for using errors='replace' in cliff instead of the default of errors='strict' as it's a CLI. This would be kinder to non unicode locales e.g if a utf8 locale was used to input a string that cannot be represented in the current locale then at least it would output the uuid that could then be used to alter the string via the set command.

Revision history for this message
Julie Pichon (jpichon) wrote :

Many thanks for the detailed comments.

Revision history for this message
John Dennis (jdennis-a) wrote :
Download full text (4.0 KiB)

Thank you Oliver for the proposed patch and thoughtful comments. You
hit upon an important issue not discussed previously, maintaining the
requested encoding when invoking the unicodecsv module, that is
important. Let's go through some concerns with the proposed patch:

1. Make unicodecsv respect the streams encoding instead of hardcoding
it to utf-8.

This is an excellent observation and we should do everything possible
to assure the same encoding is applied everywhere consistently.

However your method of reaching into the StreamWriter implementation
does not get you the encoding of the StreamWriter, rather it gives you
the encoding of the stream it's wrapping, the two are not the
same. This can easily be seen with the below snippet where the
StreamWriter is encoding to EUC-JP but you will have incorrectly
obtained the UTF-8 encoding.

>>> import sys
>>> import codecs
>>> sys.stdout.encoding
'UTF-8'
>>> s = codecs.getwriter('EUC-JP')(sys.stdout)
>>> s.stream.encoding
'UTF-8'
>>>

2. The fix is local.

By far and away the biggest problem with i18n and encoding issues in
Py2 is the tendency to put band-aids over local problems as opposed to
considering the behavior of the entire system. The accumulation of
incosistent band-aids leads to a fragile implementation that tends to
break when new band-aids get applied elsewhere. It's better to try and
fix things as close to the root as possible. OpenStack is a very large
collection of diverse code, who else may be generating encoded str
objects that will also run afoul of the StreamWriter codec error?

3. The csv file-like parameter is not at an I/O boudary.

I think you might have confused the use of a file-like object in
csv with being at an I/O boundary. In Python if you need stream
conversion filter the way you implment it is with an object exporting
the file-like operations of read,write,etc. Consider passing a
StringIO object to csv where the csv output is written to a string,
strings do not exist at an I/O boundary, csv is simply reading one
piece of data, converting the data and then writing the data in a
converted format, this is filtering, it might occur at an I/O boundary
or it might not (writing to a string being the classic example).

4. Make the fix "gloabl"

We want consistent behavior for all components, not just for csv (see
issue #2).

After I debugged this yesterday I googled around and discovered we're
not the first ones to discover the issue with the StreamWriter.

Take a look at this doc:

https://github.com/fedora-infra/kitchen/blob/develop/kitchen2/docs/unicode-frustrations.rst#frustration-4-now-it-doesnt-take-byte-strings

The entire page has a lot of good observations and is worth a read but
the salient point is they make the exact same recommendation I
did. Provide a different implementation of codecs.getwriter() which
does not attempt to encode str objects. They also provide an
implementation of such a function, although for our purposes it could
be greatly simplified, I don't think we need all the generality their
library offers.

5. Get the encoding associated with the StreamWriter

This goes back to issue #1. A very cursory examination on my part
suggests there is no ...

Read more...

Revision history for this message
Oliver Walsh (owalsh) wrote :

Hey John,

> 1. Make unicodecsv respect the streams encoding instead of hardcoding
it to utf-8.
> rather it gives you the encoding of the stream it's wrapping,

Yea, I didn't realise this was passed through to the wrapped stream in __getattr__. Should be s.encoding in the code snippet FWIW.

> 2. The fix is local.

Ack, been there with other code bases.

> who else may be generating encoded str
objects that will also run afoul of the StreamWriter codec error

Hopefully py3 support is at the point where things like this have been caught already. Of course if this is done then it's probably within a py2 only conditional block.

3. The csv file-like parameter is not at an I/O boudary.

Ack, it's unicodecsv that assumes it is. Maybe worth looking at backports.csv as an alternative.

4. Make the fix "global"

+1 but I don't see how we can avoid a local fix for unicodecsv.

5. Get the encoding associated with the StreamWriter

ack. it has the codec object so I guess we could searching all codecs until we find a match... but urrgh.

> implement our own getwriter function it would be trival
to add the encoding name to the returned object.

+1

6. Proposed patch does not have any code comments

Yup, still a WIP. Just wanted to share what I had working & the test to reproduce the issue.

Cheers,
Ollie

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cliff (master)

Fix proposed to branch: master
Review: https://review.openstack.org/508760

Changed in python-cliff:
assignee: nobody → John Dennis (jdennis-a)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cliff (master)

Change abandoned by Oliver Walsh (<email address hidden>) on branch: master
Review: https://review.openstack.org/508352
Reason: see https://review.openstack.org/508760

Julie Pichon (jpichon)
tags: added: newton-backport-potential ocata-backport-potential pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cliff (master)

Reviewed: https://review.openstack.org/508760
Committed: https://git.openstack.org/cgit/openstack/cliff/commit/?id=c61cc30060ca56257ca3504153578e02e68e7f0a
Submitter: Zuul
Branch: master

commit c61cc30060ca56257ca3504153578e02e68e7f0a
Author: John Dennis <email address hidden>
Date: Sun Oct 1 11:03:08 2017 -0400

    Fix codec error when format=csv

    The Py2 version of the csv module cannot accept unicode text. In Py2
    we replaced the csv module with the unicodecsv module which encodes
    unicode text prior to calling the standard Py2 csv module. Thus when
    the cvs formatted output is emitted on stdout it is presented as a
    encoded byte stream. But stdout has been replaced with a StreamWriter
    which encodes to the desired encoding. The problem is the StreamWriter
    attempts to encode all objects passed to it's write function,
    including str objects. Instead it should only encode unicode text objects
    and allow bytes to pass through unmodified.

    This patch adds an override of the codecs.getwriter function which
    only encodes unicode text objects. In addtion we pass the encoding
    value obtained from the stream to the unicodecsv writer.

    The patch fixes the codec error when outputing csv formated text that
    contains a non-ASCII character. The unicodecsv implmentation will emit
    byte encoded str objects to the stream. When the core StreamWriter
    attempts to encode a str object Python will first promote the str
    object to a unicode object. The promotion of str to unicode requires
    the str bytes to be decoded. However the encoding associated with the
    str object is not known therefore Python applies the default-encoding
    which is ASCII. In the case where the str object contains utf-8
    encoded non-ASCII characters a decoding error is raised. By not
    attempting to encode a byte stream we avoid this error.

    A more complete discussion of the above issues can be found here:

    https://github.com/fedora-infra/kitchen/blob/develop/kitchen2/docs/unicode-frustrations.rst#frustration-4-now-it-doesnt-take-byte-strings

    Change-Id: I22b5ad8bf0e227ec75a2a36986f0487191f7cbc2
    Closes-Bug: 1720115
    Signed-off-by: John Dennis <email address hidden>

Changed in python-cliff:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cliff (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/526056

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cliff (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/526061

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cliff (stable/pike)

Reviewed: https://review.openstack.org/526056
Committed: https://git.openstack.org/cgit/openstack/cliff/commit/?id=35f3820855413b4cd1065a00154efb82479cd328
Submitter: Zuul
Branch: stable/pike

commit 35f3820855413b4cd1065a00154efb82479cd328
Author: John Dennis <email address hidden>
Date: Sun Oct 1 11:03:08 2017 -0400

    Fix codec error when format=csv

    The Py2 version of the csv module cannot accept unicode text. In Py2
    we replaced the csv module with the unicodecsv module which encodes
    unicode text prior to calling the standard Py2 csv module. Thus when
    the cvs formatted output is emitted on stdout it is presented as a
    encoded byte stream. But stdout has been replaced with a StreamWriter
    which encodes to the desired encoding. The problem is the StreamWriter
    attempts to encode all objects passed to it's write function,
    including str objects. Instead it should only encode unicode text objects
    and allow bytes to pass through unmodified.

    This patch adds an override of the codecs.getwriter function which
    only encodes unicode text objects. In addtion we pass the encoding
    value obtained from the stream to the unicodecsv writer.

    The patch fixes the codec error when outputing csv formated text that
    contains a non-ASCII character. The unicodecsv implmentation will emit
    byte encoded str objects to the stream. When the core StreamWriter
    attempts to encode a str object Python will first promote the str
    object to a unicode object. The promotion of str to unicode requires
    the str bytes to be decoded. However the encoding associated with the
    str object is not known therefore Python applies the default-encoding
    which is ASCII. In the case where the str object contains utf-8
    encoded non-ASCII characters a decoding error is raised. By not
    attempting to encode a byte stream we avoid this error.

    A more complete discussion of the above issues can be found here:

    https://github.com/fedora-infra/kitchen/blob/develop/kitchen2/docs/unicode-frustrations.rst#frustration-4-now-it-doesnt-take-byte-strings

    Change-Id: I22b5ad8bf0e227ec75a2a36986f0487191f7cbc2
    Closes-Bug: 1720115
    Signed-off-by: John Dennis <email address hidden>
    (cherry picked from commit c61cc30060ca56257ca3504153578e02e68e7f0a)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cliff 2.10.0

This issue was fixed in the openstack/cliff 2.10.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cliff (stable/ocata)

Reviewed: https://review.openstack.org/526061
Committed: https://git.openstack.org/cgit/openstack/cliff/commit/?id=132c948aed64c8c7154d834c004a1c0e490179bb
Submitter: Zuul
Branch: stable/ocata

commit 132c948aed64c8c7154d834c004a1c0e490179bb
Author: John Dennis <email address hidden>
Date: Sun Oct 1 11:03:08 2017 -0400

    Fix codec error when format=csv

    The Py2 version of the csv module cannot accept unicode text. In Py2
    we replaced the csv module with the unicodecsv module which encodes
    unicode text prior to calling the standard Py2 csv module. Thus when
    the cvs formatted output is emitted on stdout it is presented as a
    encoded byte stream. But stdout has been replaced with a StreamWriter
    which encodes to the desired encoding. The problem is the StreamWriter
    attempts to encode all objects passed to it's write function,
    including str objects. Instead it should only encode unicode text objects
    and allow bytes to pass through unmodified.

    This patch adds an override of the codecs.getwriter function which
    only encodes unicode text objects. In addtion we pass the encoding
    value obtained from the stream to the unicodecsv writer.

    The patch fixes the codec error when outputing csv formated text that
    contains a non-ASCII character. The unicodecsv implmentation will emit
    byte encoded str objects to the stream. When the core StreamWriter
    attempts to encode a str object Python will first promote the str
    object to a unicode object. The promotion of str to unicode requires
    the str bytes to be decoded. However the encoding associated with the
    str object is not known therefore Python applies the default-encoding
    which is ASCII. In the case where the str object contains utf-8
    encoded non-ASCII characters a decoding error is raised. By not
    attempting to encode a byte stream we avoid this error.

    A more complete discussion of the above issues can be found here:

    https://github.com/fedora-infra/kitchen/blob/develop/kitchen2/docs/unicode-frustrations.rst#frustration-4-now-it-doesnt-take-byte-strings

    Conflicts:
        cliff/tests/test_app.py

    Change-Id: I22b5ad8bf0e227ec75a2a36986f0487191f7cbc2
    Closes-Bug: 1720115
    Signed-off-by: John Dennis <email address hidden>
    (cherry picked from commit c61cc30060ca56257ca3504153578e02e68e7f0a)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cliff 2.8.1

This issue was fixed in the openstack/cliff 2.8.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cliff 2.4.1

This issue was fixed in the openstack/cliff 2.4.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.