JSON is being stored in the database using text (unicode) columns

Bug #845904 reported by Gavin Panella
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Invalid
Low
Gavin Panella

Bug Description

Now that there is a JSON property class in Storm (available in Launchpad) we could improve things by switching to it using BYTEA columns.

Related branches

Revision history for this message
Robert Collins (lifeless) wrote :

That would be wrong - JSON is a text format (see http://www.ietf.org/rfc/rfc4627.txt section 3).
"3. Encoding

   JSON text SHALL be encoded in Unicode. The default encoding is
   UTF-8."

So BYTEA is just bonkers. We shouldn't use that and should help storm get fixed.

Changed in launchpad:
status: In Progress → Invalid
Revision history for this message
Gavin Panella (allenap) wrote :

Ah yes, this is because JSONVariable uses simplejson.dumps() with the default argument of ensure_ascii=True. I think this causes Storm or something between Storm and PostgreSQL as a byte string (which makes sense) and then PostgreSQL rightly doesn't attempt implicit conversion between TEXT and BYTEA. Okay, I'll file a bug against Storm to rectify this.

Revision history for this message
Gavin Panella (allenap) wrote :

Bug 846867 in Storm.

Revision history for this message
Gavin Panella (allenap) wrote :

Actually, I don't think BYTEA is bonkers. The RFC says JSON "shall be encoded... the default encoding is UTF-8", so storing it as an encoded byte string is not crazy.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 845904] Re: JSON is being stored in the database using text (unicode) columns

On Sun, Sep 11, 2011 at 11:01 PM, Gavin Panella
<email address hidden> wrote:
> Actually, I don't think BYTEA is bonkers. The RFC says JSON "shall be
> encoded... the default encoding is UTF-8", so storing it as an encoded

Its no more or less crazy than manually encoding all our unicode
strings to bytea : the point is that json's *native* serialisation is
unicode not bytes. Forcing everyone to do:
column = json.dumps(...).encode('utf8')

is quite some friction.

Revision history for this message
Gavin Panella (allenap) wrote :

> Its no more or less crazy than manually encoding all our unicode
> strings to bytea : the point is that json's *native* serialisation
> is unicode not bytes.

The RFC is confusing me then. Some of what follows is me talking to
myself:

> JSON text SHALL be encoded in Unicode.

Perhaps I'm getting fixated on bytes, but I don't really know how you
encode something in Unicode, unless this is just saying things like
"A" is 65.

> The default encoding is UTF-8.

Okay, so JSON should natively be a byte string, encoded using UTF-8,
but that's talking about what it should look like over the wire or in
a file. But PostgreSQL gives us a choice between text and bytes, so I
think it's a matter of preference which one to use.

Back to what you said:

> Forcing everyone to do:
> column = json.dumps(...).encode('utf8')
>
> is quite some friction.

Without any extra arguments, json.dumps() and simplejson.dumps()
return a str. Non-ASCII characters are given JSON escapes:

  >>> json.dumps(u"\u1234")
  '"\\u1234"'
  >>> simplejson.dumps(u"\u1234")
  '"\\u1234"'

Passing in ensure_ascii=False returns a unicode object with Unicode
code points preserved:

  >>> json.dumps(u"\u1234", ensure_ascii=False)
  u'"\u1234"'

(While on this topic, a weirdness I discovered is that
json.dumps(ensure_ascii=False) still returns a str when the JSON
contains only ASCII. Grr.)

Things are being mixed up in serializeMetadata, for example, which
prepares a bunch of data for assigning to a TEXT column (via Storm's
Unicode property):

  return simplejson.dumps(metadata_dict).decode('utf-8')

This is creating an ASCII-only string with embedded JSON \u escapes,
but the encoding would be more efficiently handled with a character
encoding, i.e. UTF-8 or whatever the database chooses to use:

  return simplejson.dumps(metadata_dict, ensure_ascii=False)

In fact, even if we were to store the JSON in a BYTEA column we would
still want to pass ensure_ascii=False so that the character encoding
takes care of the non-ASCII parts (again, I'm assuming that a sensible
character encoding would be more efficient in storage and of cpu than
using JSON \u escape encoding):

  return simplejson.dumps(
      metadata_dict, ensure_ascii=False).encode("utf-8")

tl;dr is:

- Stick with the Unicode/TEXT columns,

- Use ensure_ascii=False with dumps(),

and if the standard lib's json module is used:

- Check the return type of dumps(), to ensure it's unicode.

Revision history for this message
Robert Collins (lifeless) wrote :

> - Stick with the Unicode/TEXT columns,
>
> - Use ensure_ascii=False with dumps(),
>
> and if the standard lib's json module is used:
>
> - Check the return type of dumps(), to ensure it's unicode.

+1

Note that this is awkward because python 2 :). I sincerely hope
Python3's json module is more consistent.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.