Bug #845904 “JSON is being stored in the database using text (un...” : Bugs : Launchpad itself

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-09:

#1

That would be wrong - JSON is a text format (see http://www.ietf.org/rfc/rfc4627.txt section 3).
"3. Encoding

JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8."

So BYTEA is just bonkers. We shouldn't use that and should help storm get fixed.

Changed in launchpad:
status:	In Progress → Invalid

Revision history for this message

Gavin Panella (allenap) wrote on 2011-09-11:

#2

Ah yes, this is because JSONVariable uses simplejson.dumps() with the default argument of ensure_ascii=True. I think this causes Storm or something between Storm and PostgreSQL as a byte string (which makes sense) and then PostgreSQL rightly doesn't attempt implicit conversion between TEXT and BYTEA. Okay, I'll file a bug against Storm to rectify this.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-09-11:

#3

Bug 846867 in Storm.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-09-11:

#4

Actually, I don't think BYTEA is bonkers. The RFC says JSON "shall be encoded... the default encoding is UTF-8", so storing it as an encoded byte string is not crazy.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-11: Re: [Bug 845904] Re: JSON is being stored in the database using text (unicode) columns

#5

On Sun, Sep 11, 2011 at 11:01 PM, Gavin Panella
<email address hidden> wrote:
> Actually, I don't think BYTEA is bonkers. The RFC says JSON "shall be
> encoded... the default encoding is UTF-8", so storing it as an encoded

Its no more or less crazy than manually encoding all our unicode
strings to bytea : the point is that json's *native* serialisation is
unicode not bytes. Forcing everyone to do:
column = json.dumps(...).encode('utf8')

is quite some friction.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-09-11:

#6

> Its no more or less crazy than manually encoding all our unicode
> strings to bytea : the point is that json's *native* serialisation
> is unicode not bytes.

The RFC is confusing me then. Some of what follows is me talking to
myself:

> JSON text SHALL be encoded in Unicode.

Perhaps I'm getting fixated on bytes, but I don't really know how you
encode something in Unicode, unless this is just saying things like
"A" is 65.

> The default encoding is UTF-8.

Okay, so JSON should natively be a byte string, encoded using UTF-8,
but that's talking about what it should look like over the wire or in
a file. But PostgreSQL gives us a choice between text and bytes, so I
think it's a matter of preference which one to use.

Back to what you said:

> Forcing everyone to do:
> column = json.dumps(...).encode('utf8')
>
> is quite some friction.

Without any extra arguments, json.dumps() and simplejson.dumps()
return a str. Non-ASCII characters are given JSON escapes:

  >>> json.dumps(u"\u1234")
  '"\\u1234"'
  >>> simplejson.dumps(u"\u1234")
  '"\\u1234"'

Passing in ensure_ascii=False returns a unicode object with Unicode
code points preserved:

>>> json.dumps(u"\u1234", ensure_ascii=False)
u'"\u1234"'

(While on this topic, a weirdness I discovered is that
json.dumps(ensure_ascii=False) still returns a str when the JSON
contains only ASCII. Grr.)

Things are being mixed up in serializeMetadata, for example, which
prepares a bunch of data for assigning to a TEXT column (via Storm's
Unicode property):

return simplejson.dumps(metadata_dict).decode('utf-8')

This is creating an ASCII-only string with embedded JSON \u escapes,
but the encoding would be more efficiently handled with a character
encoding, i.e. UTF-8 or whatever the database chooses to use:

return simplejson.dumps(metadata_dict, ensure_ascii=False)

In fact, even if we were to store the JSON in a BYTEA column we would
still want to pass ensure_ascii=False so that the character encoding
takes care of the non-ASCII parts (again, I'm assuming that a sensible
character encoding would be more efficient in storage and of cpu than
using JSON \u escape encoding):

return simplejson.dumps(
metadata_dict, ensure_ascii=False).encode("utf-8")

tl;dr is:

- Stick with the Unicode/TEXT columns,

- Use ensure_ascii=False with dumps(),

and if the standard lib's json module is used:

- Check the return type of dumps(), to ensure it's unicode.

> Its no more or less crazy than manually encoding all our unicode
> strings to bytea : the point is that json's *native* serialisation
> is unicode not bytes.

The RFC is confusing me then. Some of what follows is me talking to
myself:

> JSON text SHALL be encoded in Unicode.

Perhaps I'm getting fixated on bytes, but I don't really know how you
encode something in Unicode, unless this is just saying things like
"A" is 65.

> The default encoding is UTF-8.

Okay, so JSON should natively be a byte string, encoded using UTF-8,
but that's talking about what it should look like over the wire or in
a file. But PostgreSQL gives us a choice between text and bytes, so I
think it's a matter of preference which one to use.

Back to what you said:

> Forcing everyone to do:
> column = json.dumps(...).encode('utf8')
> 
> is quite some friction.

Without any extra arguments, json.dumps() and simplejson.dumps()
return a str. Non-ASCII characters are given JSON escapes:

>>> json.dumps(u"\u1234")
  '"\\u1234"'
  >>> simplejson.dumps(u"\u1234")
  '"\\u1234"'

Passing in ensure_ascii=False returns a unicode object with Unicode
code points preserved:

>>> json.dumps(u"\u1234", ensure_ascii=False)
  u'"\u1234"'

(While on this topic, a weirdness I discovered is that
json.dumps(ensure_ascii=False) still returns a str when the JSON
contains only ASCII. Grr.)

Things are being mixed up in serializeMetadata, for example, which
prepares a bunch of data for assigning to a TEXT column (via Storm's
Unicode property):

return simplejson.dumps(metadata_dict).decode('utf-8')

This is creating an ASCII-only string with embedded JSON \u escapes,
but the encoding would be more efficiently handled with a character
encoding, i.e. UTF-8 or whatever the database chooses to use:

return simplejson.dumps(metadata_dict, ensure_ascii=False)

In fact, even if we were to store the JSON in a BYTEA column we would
still want to pass ensure_ascii=False so that the character encoding
takes care of the non-ASCII parts (again, I'm assuming that a sensible
character encoding would be more efficient in storage and of cpu than
using JSON \u escape encoding):

return simplejson.dumps(
      metadata_dict, ensure_ascii=False).encode("utf-8")

tl;dr is:

- Stick with the Unicode/TEXT columns,

- Use ensure_ascii=False with dumps(),

and if the standard lib's json module is used:

- Check the return type of dumps(), to ensure it's unicode.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-11:

#7

> - Stick with the Unicode/TEXT columns,
>
> - Use ensure_ascii=False with dumps(),
>
> and if the standard lib's json module is used:
>
> - Check the return type of dumps(), to ensure it's unicode.

+1

Note that this is awkward because python 2 :). I sincerely hope
Python3's json module is more consistent.

Launchpad itself

JSON is being stored in the database using text (unicode) columns

Bug Description

Related branches

Other bug subscribers

Remote bug watches