Bug #1783475 “const link support can't handle escaped charactors...” : Series 7.0 : Bugs : EPICS Base

Revision history for this message

Andrew Johnson (anj) wrote on 2018-12-11:

#1

Marking this as won't fix in 3.16, still live for the 7.0 branch though.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-07-27:

#2

Download full text (4.0 KiB)

This bug is a symptom of the way that we currently parse both field(name, value) and info(name, value) statements in a .db file, which looks a little tricky to fix but I am working on it and have just reached it as part of my JSON5 changes. I will describe what is happening and how JSON5 changes things, then propose a way forward.

In the dbStatic parser's dbRecordField() and dbRecordInfo() routines, the <value> is now expected to be JSON encoded — the dbLex.l & dbYacc.c code have already made sure of that by the time these routines get called. The code in both routines looks at the first character of <value> and if it's a double-quote it strips that and the last character (which must also be a double-quote to pass the lexer). The string is then filtered through dbTranslateEscape(), and the result given to dbPutString() to actually set the field value.

The example in this bug description shows that the processing of <value> described above is incorrect. We should not be translating escaped characters that appear inside a JSON map since the yajl parser in dbConstLink.c will do the translation for us later on. However we do need to translate any escapes inside simple quoted string values because they don't get passed through yajl at all.

A simple fix is thus to only call dbTranslateEscape() when the value was a quoted string, and this change does solve the example shown in the bug. However since we say these values are JSON we should accept the \u201c unicode escaped character forms, which dbTranslateEscape() doesn't understand.

Also, and nobody has actually reported this yet, before I introduced the JSON parser it was possible to use the C octal and hex escape formats \ooo and \xXX inside quoted string values, which dbTranslateEscape() does understand. However they don't work any more because the JSON lexer rejects them before they can reach the dbRecordField() routine. JSON only accepts a back-slash before a very specific set of characters so back-slash followed by an x or a digit in a string causes the parser to abort.

JSON5 however allows any character to be escaped; the back-slash will be dropped if the combination has no special meaning. In addition to the \u followed by 4 hex digits which JSON accepts, JSON5 also accepts \x followed by 2 hex digits (although I just discovered that my YAJL changes for JSON5 don't implement this so I've now got to go back and fix that too). JSON5 does not accept C's octal escape sequences \ooo at all, but I doubt if anyone will particularly miss them nowadays.

Our dbTranslateEscape() code doesn't implement quite the same rules as JSON5 although it's pretty close if we don't care about the unicode form. The differences are that we translate "\a" to (whatever C says it should become) and "\0" through "\7" introduce an octal numeric escape of up to 3 digits, whereas JSON5 says "\a" should produce "a", "\0" produce a zero byte and "\1" through "\7" generate "1" through "7" respectively. Our "\x" parsing also looks very suspicious as it doesn't limit itself to just 2 hex digits as JSON5 requires.

So to summarize: An incomplete but quick fix would be to move the dbTranslateEscape() call to inside the code...

This bug is a symptom of the way that we currently parse both field(name, value) and info(name, value) statements in a .db file, which looks a little tricky to fix but I am working on it and have just reached it as part of my JSON5 changes. I will describe what is happening and how JSON5 changes things, then propose a way forward.

In the dbStatic parser's dbRecordField() and dbRecordInfo() routines, the <value> is now expected to be JSON encoded — the dbLex.l & dbYacc.c code have already made sure of that by the time these routines get called. The code in both routines looks at the first character of <value> and if it's a double-quote it strips that and the last character (which must also be a double-quote to pass the lexer). The string is then filtered through dbTranslateEscape(), and the result given to dbPutString() to actually set the field value.

The example in this bug description shows that the processing of <value> described above is incorrect. We should not be translating escaped characters that appear inside a JSON map since the yajl parser in dbConstLink.c will do the translation for us later on. However we do need to translate any escapes inside simple quoted string values because they don't get passed through yajl at all.

A simple fix is thus to only call dbTranslateEscape() when the value was a quoted string, and this change does solve the example shown in the bug. However since we say these values are JSON we should accept the \u201c unicode escaped character forms, which dbTranslateEscape() doesn't understand.

Also, and nobody has actually reported this yet, before I introduced the JSON parser it was possible to use the C octal and hex escape formats \ooo and \xXX inside quoted string values, which dbTranslateEscape() does understand. However they don't work any more because the JSON lexer rejects them before they can reach the dbRecordField() routine. JSON only accepts a back-slash before a very specific set of characters so back-slash followed by an x or a digit in a string causes the parser to abort.

JSON5 however allows any character to be escaped; the back-slash will be dropped if the combination has no special meaning. In addition to the \u followed by 4 hex digits which JSON accepts, JSON5 also accepts \x followed by 2 hex digits (although I just discovered that my YAJL changes for JSON5 don't implement this so I've now got to go back and fix that too). JSON5 does not accept C's octal escape sequences \ooo at all, but I doubt if anyone will particularly miss them nowadays.

Our dbTranslateEscape() code doesn't implement quite the same rules as JSON5 although it's pretty close if we don't care about the unicode form. The differences are that we translate "\a" to (whatever C says it should become) and "\0" through "\7" introduce an octal numeric escape of up to 3 digits, whereas JSON5 says "\a" should produce "a", "\0" produce a zero byte and "\1" through "\7" generate "1" through "7" respectively. Our "\x" parsing also looks very suspicious as it doesn't limit itself to just 2 hex digits as JSON5 requires.

So to summarize: An incomplete but quick fix would be to move the dbTranslateEscape() call to inside the code that strips the leading and trailing quotes from a simple string value. This solves some issues, but isn't complete. A better fix can come from my work adding JSON5 but we still have to decide if we care about unicode escapes; if we don't moving the dbTranslateEscape() call (and fixing the 0x parsing) is relatively easy. For fully compliant handling I would probably add another yajl parser to dbLexRoutines() and use that to translate the quoted string value.

If we picked the middle option the differences between the two translators would show up here:
    record(stringin, "s1") {
        field(INP, {const:"string-with-escapes"})
        field(DESC, "string-with-escapes")
    }
The s1.VAL field gets translated by yajl following the JSON5 rules while parsing the const input link. The s1.DESC field would be translated by dbTranslateEscape().

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2020-07-27:

#3

\x should definitely limit itself to 2 hex digits.

Is is worth the effort to pre-process the strings and translate e.g. \033 to \x1b and \a to \x07 so the json parser understands it? Problem is that the string may become longer and thus replacement cannot be done in-place.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-07-28:

#4

I have already fixed the \x handling of the epicsStrnRawFromEscaped() routine on my branch, and changed how \400 gets parsed (to " 0" if you want to know), adding new tests for both cases.

Converting escapes into something that the JSON5 parser understands wouldn't be too hard to do inside the dbLex.l code that returns a jsonSTRING to the parser. It already has to do a temporary memory allocation at that point anyway, but scanning every string (again) for those escapes would of course slow down the parser slightly. Do you have any examples of real databases that actually use \a or the octal escape notation?

Revision history for this message

Andrew Johnson (anj) wrote on 2020-07-28:

#5

It will probably be easier to teach the yajl code to understand octal escapes if we want to allow them, and the result will be faster too. The ECMA-252 (JavaScript) standard seems to allow them, but only for compatibility purposes (the JSON5 spec doesn't really say either way). It still won't understand \a though.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-08-02:

#6

I've changed my mind about octal escapes.

The JSON5 spec says "\012" should decode as a <nil> byte followed by "12", whereas C and dbTranslateEscape() decode it as a line-feed character "\n". We probably don't want to get different values for the VAL and DESC fields as would happen here if we don't do something:

record(stringin, s1) {
field(DESC, "\041") # "!" - dbTranslateEscape()
field(INP, {const:"\041"}) # "" - yajl
}

I now think the best solution is to drop support for octal character escapes completely, and require strings to use \xXX hex escapes if they can't use literal characters directly. I suspect octal escapes are very rarely used now, and removing them would resolve this difference, leaving just \a and \uXXXX to be decoded differently.

I will implement and document this change on my json5 branch.

Revision history for this message

Andrew Johnson (anj) wrote on 2020-08-09:

#7

PR for my json5 branch on GitHub at https://github.com/epics-base/epics-base/pull/86

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2020-08-10:

#8

I don't think that many people use \a (aka "bell") in a record field. I mean what to expect? That the computer rings the bell when the field is displayed? Thus I am fine with not supporting \a. (Or \e = 0x1b = ESC for that matter, which is a non-standard extension anyway).
The same for octal. Who would volutarily use octal?

What may be a bit surprizing is that the JSON5 specs require that \x must be followed by exactly 2 hex digits, not by up to 2 digits, while C happily accepts "\xf".

Revision history for this message

rivers (rivers) wrote on 2020-08-10:

#9

> The same for octal. Who would volutarily use octal?

Many Modbus vendors specify the register addresses in octal, so it can be very convenient to use octal addresses in the link fields for the EPICS modbus driver. For exmaple, search for "octal" in this document.
https://epics-modbus.readthedocs.io/en/latest/

These octal values would occur in the INP and OUT fields of records. I don't know if those would be impacted by this change?

Revision history for this message

Andrew Johnson (anj) wrote on 2020-08-10:

#10

@Mark This won't prevent applications from parsing octal numbers in strings, it's just where you're wanting to include a non-ASCII character in a string field, something like this:

field(FVST, "F\374nf") # "Fünf" using the ISO-8859-1 char-set

which would have to be spelled like this instead:

field(FVST, "F\xFCnf")

@Dirk The JSON5 spec is a small subset of the ECMAScript 5.1 language spec, so they're just following what that standard said. Actually the epicsStrnRawFromEscaped() routine will still accept a single hex digit, but the dbLoadRecords() lexer will complain if it doesn't see two.

Revision history for this message

Andrew Johnson (anj) wrote on 2021-02-28:

#11

The 7.0.5 release includes changes to support JSON-5, and the behavior is now:

woz$ cat ~/db/test.db
record(stringin, "json") {
field(INP, {const:"multi\nline"})
}
woz$ softIoc -d ~/db/test.db
Starting iocInit
############################################################################
## EPICS R7.0.4.2-DEV
## Rev. R7.0.4.1-190-g6734918e6e188b1f846e-dirty
############################################################################
iocRun: All initialization complete
epics> dbgf json
DBF_STRING: "multi\nline"
epics> dbpr json 1
APST: On Change ASG : BKPT: 00 DESC:
DISA: 0 DISP: 0 DISS: NO_ALARM DISV: 1
DTYP: Soft Channel EVNT: FLNK: CONSTANT
INP : JSON_LINK {const:"multi\nline"} MPST: On Change NAME: json
PACT: 0 PHAS: 0 PINI: NO PRIO: LOW
PUTF: 0 RPRO: 0 SCAN: Passive SDIS: CONSTANT
SEVR: INVALID SIML: CONSTANT SIMM: NO SIOL: CONSTANT
SSCN: <nil> STAT: UDF SVAL: TPRO: 0
TSE : 0 TSEL: CONSTANT UDF : 0 UDFS: INVALID
VAL : multi
line

Has this fixed the bug, or just changed it? Note that dbgf is escaping the string before printing it, whereas dbpr just prints in verbatim.

	Status	Importance	Assigned to
EPICS Base	Status tracked in 7.0
3.16	Won't Fix	Undecided	Unassigned
7.0	Fix Released	Low	Andrew Johnson

EPICS Base

const link support can't handle escaped charactors

Bug Description

Other bug subscribers

Remote bug watches