JSON parser doesn't recognize UTF-16 surrogate pairs

Bug #1024448 reported by Dennis Knochenwefel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zorba
Fix Released
High
Paul J. Lucas

Bug Description

The JSON parser doesn't recognize UTF-16 surrogate pairs, e.g., the byte sequence "\ud83d\udc4a" is currently converted to two separate Unicode code-points when it ought to recognize that as a UTF-16 surrogate pair and result in the Unicode code-point of 1F44A.

Related branches

Changed in zorba:
importance: Undecided → High
Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

If there are 3 problems, there should be 3 different bugs, not all lumped together into a single bug. I have nothing to do with either html:parse() or tidy.

As for the 3rd bug, you don't say what the error is that it should report. What exactly is wrong with JSON parsing?

Changed in zorba:
status: New → Incomplete
description: updated
Revision history for this message
Dennis Knochenwefel (dennis-knochenwefel) wrote :

I'm not an encoding expert, so anything I say may potentially be wrong.

The string "\ud83d\udc4a" is an example containing a single javascript escaped special character (cf http://www.charbase.com/1f44a-unicode-fisted-hand-sign ). This is very common in JSON data as javascript engines seem to use encodings utf-16 or ucs-2 internally.

I believe that the json parser attempts to parse "\ud83d\udc4a" as two single utf-8 characters. As a result, it returns a string containing invalid codepoints. This can be reproduced with the following query:

  import module namespace json = "http://www.zorba-xquery.com/modules/converters/json";
  declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";
  json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text()

returns:

  dynamic error [err:FOCH0001]: "55357":
  invalid code point; raised at runtime\zorba\src\api\serialization\serializer.cpp:204

Would it be possible for the json parser to detect utf-16 encoded characters and convert them into valid utf-8 characters?

Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

I'm not sure I understand.

1. The default for JSON strings seems to be UTF-8.
2. If a JSON string uses an encoding other than UTF-8, the entire string should be transcoded. This needs to be done when the data its retrieve. For example, by passing an encoding parameter to file:read-text.

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

I believe what's going on is that byte sequences like \ud83d\udc4a are supposed to represent UTF-16 surrogate pairs. This is what Dennis suggests since 1F44A is the Unicode code point represented.

IMHO, this is a bizarre way to do things: use a UTF-8 byte sequence to encode UTF-16 surrogate pairs. The code-points represented by the surrogate pairs should just be encoded in UTF-8 directly.

That said, I believe it's probably possible to handle this bizarre case and "do the right thing."

summary: - data-converter module problems with non utf-8 characters
+ JSON parser doesn't recognize UTF-16 surrogate pairs
description: updated
Changed in zorba:
status: Incomplete → In Progress
Changed in zorba:
status: In Progress → Fix Committed
Changed in zorba:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.