JSON loader fails with 'Unpaired high surrogate'

Bug #876810 reported by Bob Copeland
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Meliae
New
Undecided
Unassigned

Bug Description

When loading the attached file, I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/dist-packages/meliae/loader.py", line 549, in load
    manager = _load(source, using_json, show_prog, input_size)
  File "/usr/lib/python2.6/dist-packages/meliae/loader.py", line 622, in _load
    factory=objs.add):
  File "/usr/lib/python2.6/dist-packages/meliae/loader.py", line 603, in iter_objs
    yield decoder(factory, line, temp_cache=temp_cache)
  File "/usr/lib/python2.6/dist-packages/meliae/loader.py", line 64, in _from_json
    val = simplejson.loads(line)
  File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 384, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 402, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 418, in raw_decode
    obj, end = self.scan_once(s, idx)
simplejson.decoder.JSONDecodeError: Unpaired high surrogate: line 1 column 85 (char 85)

These 2 lines are actual lines from my much bigger heap dump. The string is clearly bogus unicode data, so I'm not sure if this is a problem with the writer when serializing unicode strings, or a problem with the loader in not ignoring the escapes.

The following egregious hack gets me past, but a proper fix would be welcome:

def _from_json(cls, line, temp_cache=None):
    try:
        val = simplejson.loads(line)
    except:
        print "Could not parse %s" % line
        val = simplejson.loads(re.sub(r'value": "[^"]*"', 'value": "x"', line))

    # etc ....

Revision history for this message
Bob Copeland (copeland) wrote :
Revision history for this message
Bob Copeland (copeland) wrote :

Seems this is the root cause: http://bugs.python.org/issue11489

e.g. this script fails:

#!/usr/bin/python
import simplejson
import binascii

origstr = binascii.unhexlify('EDA588C2A36C6C6F')
z = simplejson.dumps(origstr)
print z
y = simplejson.loads(z)
print y

Revision history for this message
Ilya Murav'jov (muravjov-il) wrote :

I just want to make it clear: the simplejson' maintainers state that JSON may not contain some binary data, even serialized in \uXXXX form (because JSON is text format in Unicode, and i.e. \ud800 is lone surrogate).

Now I for one convert all \udXXX strings to neutral #SdXXX before analyzing the dumps, like so:

surrogate = re.compile(r"(?<!\\)\\u([dD][0-9a-fA-F]{3,3})")
def replace_surrogates(sample):
    return surrogate.sub("#S\g<1>", sample)

Thus it seems that the official JSON is not so good for saving meliae dumps.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.