Images API v2 utf-8 tags returned as unicode

Bug #1045455 reported by Brian Waldon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance
Fix Released
Medium
Eddie Sheffield

Bug Description

If I update an image with tag '™', the tag is added and it is deletable, but it is presented as a unicode character: '\u2122'. I would expect to see it as '\xe2\x84\xa2'.

vagrant@precise:~/devstack$ curl -X PUT -H 'content-type: application/json' -d '{"tags":["™"]}' -i -H 'x-auth-token: 2761f282515c4e9d9370cafabf73dfea' http://localhost:9292/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2
HTTP/1.1 200 OK
Content-Length: 545
Content-Type: application/json; charset=UTF-8
X-Openstack-Request-Id: req-be760442-01e0-4148-9b40-d34b4dd680ca
Date: Mon, 03 Sep 2012 18:34:34 GMT

{"status": "active", "name": "cirros-0.3.0-x86_64-uec-kernel", "tags": ["\u2122"], "container_format": "aki", "created_at": "2012-08-30T23:41:04Z", "disk_format": "aki", "updated_at": "2012-09-03T18:34:34Z", "visibility": "public", "id": "5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "protected": false, "min_ram": 0, "file": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2/file", "checksum": "cfb203e7267a28e435dbcb05af5910a9", "min_disk": 0, "size": 4731440, "self": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "schema": "/v2/schemas/image"}

Changed in glance:
assignee: nobody → Eddie Sheffield (eddie-sheffield)
Changed in glance:
status: Triaged → In Progress
Revision history for this message
Brian Waldon (bcwaldon) wrote :

This probably applies to all image properties as well, not just tags.

Revision history for this message
Eddie Sheffield (eddie-sheffield) wrote :

This isn't really a bug - the UTF-8 encoding comes into play in how the actual response body string is encoded "on the wire, " not how the string values are displayed. The "\u2122" is the unicode code point which is independent of the actual encoding and is more universal. To test this, I ran it through Javascript with both the code point as currently returned and with the UTF-8 notation:

<script>
var j1 = {"status": "active", "name": "cirros-0.3.0-x86_64-uec-kernel", "tags": ["\u2122"], "container_format": "aki", "created_at": "2012-08-30T23:41:04Z", "disk_format": "aki", "updated_at": "2012-09-03T18:34:34Z", "visibility": "public", "id": "5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "protected": false, "min_ram": 0, "file": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2/file", "checksum": "cfb203e7267a28e435dbcb05af5910a9", "min_disk": 0, "size": 4731440, "self": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "schema": "/v2/schemas/image"};

var j2 = {"status": "active", "name": "cirros-0.3.0-x86_64-uec-kernel", "tags": ["\xe2\x84\xa2"], "container_format": "aki", "created_at": "2012-08-30T23:41:04Z", "disk_format": "aki", "updated_at": "2012-09-03T18:34:34Z", "visibility": "public", "id": "5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "protected": false, "min_ram": 0, "file": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2/file", "checksum": "cfb203e7267a28e435dbcb05af5910a9", "min_disk": 0, "size": 4731440, "self": "/v2/images/5fe4ddf1-4228-4d2d-8c1a-ffe5db9bc4f2", "schema": "/v2/schemas/image"};

window.alert("j1.tags[0]: " + j1.tags[0]);

window.alert("j2.tags[0]: " + j2.tags[0]);

</script>

If you load that up in a web browser, you'll see that the first representation (\u2122) is properly interpreted, while the UTF-8 representation displays garbage.

Changed in glance:
status: In Progress → Invalid
Revision history for this message
Brian Waldon (bcwaldon) wrote :

Does the browser interpret your utf-8-encoded tag as utf-8? In my example above, the Content-Type is returned with charset utf-8, while \u2122 is NOT utf-8 - it's unicode. If you hack in a static '\xe2\x84\xa2` to the Glance response, you'll see that it can then be correctly interpreted by tools like curl and displayed properly on the command-line.

Changed in glance:
status: Invalid → Triaged
Revision history for this message
Eddie Sheffield (eddie-sheffield) wrote :

I tried as you suggested and hacked in a static '\xe2\x84\xa2' (added a property to the image just before being turned to JSON in the v2 image serializer). Even using curl, this new property came through as '\u2122'. The confusion seems to be where the content-type comes into play - it defines the encoding of the raw data between the server and and the client, not how the client then interprets the unencoded data. By the time JS (or curl, or whatever client) is using the data, the underlying libs have translated that utf-8 data into whatever the internal string representation is (maybe utf-8, or utf-16, or ascii, or ...) At the application level you really shouldn't be seeing UTF-8 - you want unicode because that's the general form of the characters. UTF-8 is largely an underlying implementation detail that as long as the server and client agree on it's use as the wire format (via the content-type), the higher level code will just see a string in whatever its native format happens to be.

Also, looking at http://www.json.org/ the '\xe2\x84\xa2' representation is not valid. \xHH is not a valid JSON escape code.

Revision history for this message
Brian Waldon (bcwaldon) wrote :

Sorry, I should have added that you need to use ensure_ascii=False in the json.dumps call. So I may be missing something here, but here's my thinking: if you get the header 'content-type: application/json; charset=utf-8', in an http response, one must first decode the body as utf-8 then as json. What I really want to be able to do is see my terminal recognize the \xNN characters as utf-8 and present them to me properly. I accomplished this by hacking in a fake tag and using the ensure_ascii kwarg to json.dumps.

Changed in glance:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance (master)

Fix proposed to branch: master
Review: https://review.openstack.org/12528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance (master)

Reviewed: https://review.openstack.org/12528
Committed: http://github.com/openstack/glance/commit/c008cef084c6362277e88569831ac09d81837327
Submitter: Jenkins
Branch: master

commit c008cef084c6362277e88569831ac09d81837327
Author: Eddie Sheffield <email address hidden>
Date: Thu Sep 6 17:13:21 2012 -0400

    Return actual unicode instead of escape sequences in v2.

    Ensured that when images are serialized to json unicode characters
    are preserved as-is rather than being translated to ASCII escape
    sequences.

    Fixes bug 1045455

    Change-Id: Ica6dc222bb8c8049cba7049720442d4c5bbb7d32

Changed in glance:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in glance:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in glance:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.