Zorba

JSON parser doesn't recognize UTF-16 surrogate pairs

Bug #1024448 reported by Dennis Knochenwefel on 2012-07-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Zorba	Fix Released	High	Paul J. Lucas	Zorba 2.7 "Gaia"

Bug Description

The JSON parser doesn't recognize UTF-16 surrogate pairs, e.g., the byte sequence "\ud83d\udc4a" is currently converted to two separate Unicode code-points when it ought to recognize that as a UTF-16 surrogate pair and result in the Unicode code-point of 1F44A.

See original description

Tags:

Related branches

lp:~paul-lucas/zorba/bug-1024448

Merged into lp:zorba at revision 10944

Dennis Knochenwefel: Approve on 2012-07-17

Paul J. Lucas: Approve on 2012-07-17

Dennis Knochenwefel (dennis-knochenwefel) on 2012-07-13

Changed in zorba:
importance:	Undecided → High

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2012-07-13:

If there are 3 problems, there should be 3 different bugs, not all lumped together into a single bug. I have nothing to do with either html:parse() or tidy.

As for the 3rd bug, you don't say what the error is that it should report. What exactly is wrong with JSON parsing?

Changed in zorba:
status:	New → Incomplete

Dennis Knochenwefel (dennis-knochenwefel) on 2012-07-16

description:

updated

Revision history for this message

Dennis Knochenwefel (dennis-knochenwefel) wrote on 2012-07-16:

I'm not an encoding expert, so anything I say may potentially be wrong.

The string "\ud83d\udc4a" is an example containing a single javascript escaped special character (cf http://www.charbase.com/1f44a-unicode-fisted-hand-sign ). This is very common in JSON data as javascript engines seem to use encodings utf-16 or ucs-2 internally.

I believe that the json parser attempts to parse "\ud83d\udc4a" as two single utf-8 characters. As a result, it returns a string containing invalid codepoints. This can be reproduced with the following query:

  import module namespace json = "http://www.zorba-xquery.com/modules/converters/json";
  declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";
  json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text()

returns:

dynamic error [err:FOCH0001]: "55357":
invalid code point; raised at runtime\zorba\src\api\serialization\serializer.cpp:204

Would it be possible for the json parser to detect utf-16 encoded characters and convert them into valid utf-8 characters?

Revision history for this message

Matthias Brantner (matthias-brantner) wrote on 2012-07-16:

I'm not sure I understand.

1. The default for JSON strings seems to be UTF-8.
2. If a JSON string uses an encoding other than UTF-8, the entire string should be transcoded. This needs to be done when the data its retrieve. For example, by passing an encoding parameter to file:read-text.

Revision history for this message

Paul J. Lucas (paul-lucas) wrote on 2012-07-16:

I believe what's going on is that byte sequences like \ud83d\udc4a are supposed to represent UTF-16 surrogate pairs. This is what Dennis suggests since 1F44A is the Unicode code point represented.

IMHO, this is a bizarre way to do things: use a UTF-8 byte sequence to encode UTF-16 surrogate pairs. The code-points represented by the surrogate pairs should just be encoded in UTF-8 directly.

That said, I believe it's probably possible to handle this bizarre case and "do the right thing."

summary:	- data-converter module problems with non utf-8 characters + JSON parser doesn't recognize UTF-16 surrogate pairs
description:	updated

Paul J. Lucas (paul-lucas) on 2012-07-16

Changed in zorba:
status:	Incomplete → In Progress

Paul J. Lucas (paul-lucas) on 2012-07-17

Changed in zorba:
status:	In Progress → Fix Committed

Dana Florescu (dflorescu) on 2012-10-24

Changed in zorba:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.