Comment 2 for bug 1783475

Revision history for this message
Andrew Johnson (anj) wrote :

This bug is a symptom of the way that we currently parse both field(name, value) and info(name, value) statements in a .db file, which looks a little tricky to fix but I am working on it and have just reached it as part of my JSON5 changes. I will describe what is happening and how JSON5 changes things, then propose a way forward.

In the dbStatic parser's dbRecordField() and dbRecordInfo() routines, the <value> is now expected to be JSON encoded — the dbLex.l & dbYacc.c code have already made sure of that by the time these routines get called. The code in both routines looks at the first character of <value> and if it's a double-quote it strips that and the last character (which must also be a double-quote to pass the lexer). The string is then filtered through dbTranslateEscape(), and the result given to dbPutString() to actually set the field value.

The example in this bug description shows that the processing of <value> described above is incorrect. We should not be translating escaped characters that appear inside a JSON map since the yajl parser in dbConstLink.c will do the translation for us later on. However we do need to translate any escapes inside simple quoted string values because they don't get passed through yajl at all.

A simple fix is thus to only call dbTranslateEscape() when the value was a quoted string, and this change does solve the example shown in the bug. However since we say these values are JSON we should accept the \u201c unicode escaped character forms, which dbTranslateEscape() doesn't understand.

Also, and nobody has actually reported this yet, before I introduced the JSON parser it was possible to use the C octal and hex escape formats \ooo and \xXX inside quoted string values, which dbTranslateEscape() does understand. However they don't work any more because the JSON lexer rejects them before they can reach the dbRecordField() routine. JSON only accepts a back-slash before a very specific set of characters so back-slash followed by an x or a digit in a string causes the parser to abort.

JSON5 however allows any character to be escaped; the back-slash will be dropped if the combination has no special meaning. In addition to the \u followed by 4 hex digits which JSON accepts, JSON5 also accepts \x followed by 2 hex digits (although I just discovered that my YAJL changes for JSON5 don't implement this so I've now got to go back and fix that too). JSON5 does not accept C's octal escape sequences \ooo at all, but I doubt if anyone will particularly miss them nowadays.

Our dbTranslateEscape() code doesn't implement quite the same rules as JSON5 although it's pretty close if we don't care about the unicode form. The differences are that we translate "\a" to (whatever C says it should become) and "\0" through "\7" introduce an octal numeric escape of up to 3 digits, whereas JSON5 says "\a" should produce "a", "\0" produce a zero byte and "\1" through "\7" generate "1" through "7" respectively. Our "\x" parsing also looks very suspicious as it doesn't limit itself to just 2 hex digits as JSON5 requires.

So to summarize: An incomplete but quick fix would be to move the dbTranslateEscape() call to inside the code that strips the leading and trailing quotes from a simple string value. This solves some issues, but isn't complete. A better fix can come from my work adding JSON5 but we still have to decide if we care about unicode escapes; if we don't moving the dbTranslateEscape() call (and fixing the 0x parsing) is relatively easy. For fully compliant handling I would probably add another yajl parser to dbLexRoutines() and use that to translate the quoted string value.

If we picked the middle option the differences between the two translators would show up here:
    record(stringin, "s1") {
        field(INP, {const:"string-with-escapes"})
        field(DESC, "string-with-escapes")
    }
The s1.VAL field gets translated by yajl following the JSON5 rules while parsing the const input link. The s1.DESC field would be translated by dbTranslateEscape().