Bug #1940283 “Misleading Error: Invalid character '"'” : Bugs : EPICS Base

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#1

There is a tab, not a space between a and b in the examples above.

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#2

Tabs at the beginning or end of the string give even stranger messages:

info(TAB, "a ")
info(TAB, " b")
Error: Invalid character '"'
at or before """ in path "." file "test.db" line 3
Error: Newline in string, closing quote missing
Error: syntax error

info(TAB, " ")
Error: Invalid character '"'
at or before """ in path "." file "test.db" line 3
Error: Invalid character '"'
Error: Invalid character ')'
Error: syntax error

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#3

With other "funny" characters, the result is somewhat different.
For example the "bell" char 0x07.

Error: Invalid character '"'
at or before """ in path "." file "test.db" line 3
Error: Invalid character 0x07
Error: Invalid character '"'
Error: Invalid character ')'
Error: syntax error

At least 0x07 is mentioned, but the message is still misleading.

Revision history for this message

Andrew Johnson (anj) wrote on 2021-08-17:

#4

Why? Because dbLex.l has bugs in the JSON string lexing patterns. Congratulations, you're the first person to have noticed since JSON support was introduced to the database!

Revision history for this message

mdavidsaver (mdavidsaver) wrote on 2021-08-17:

#5

In dbLex.l, I guess the isprint() could change to something else? isprint() && !isspace(). Or maybe isgraph() if this is portable.

> if (isprint((int) yytext[0])) {
> sprintf(message, "Invalid character '%c'", yytext[0]);
> }
> else {
> sprintf(message, "Invalid character 0x%2.2x", yytext[0]);
> }

Offset tracking would be nice, but may be a more complex change.

Revision history for this message

Andrew Johnson (anj) wrote on 2021-08-17:

#6

I take that back, this isn't a bug so much as a request that we be more liberal than the JSON-5 spec permits. I might be open to that, but first we should make the error easier to understand. Here's what's happening:

The values of info() and field() entries in a database file are supposed to be valid JSON-5 values, and are currently parsed as such.

Looking at the JSON spec https://datatracker.ietf.org/doc/html/rfc7159 and the diagrams on https://www.json.org/json-en.html that spec doesn't actually allow *any* unescaped control characters inside string values. The spec says:

   A string begins and ends with
   quotation marks. All Unicode characters may be placed within the
   quotation marks, except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

Thus all control characters are supposed to be escaped according to the older JSON rules.

The JSON-5 spec at https://spec.json5.org/#strings has similar language, although its BNF also confusingly allows an unescaped SourceCharacter from the ECMAScript language spec at https://262.ecma-international.org/5.1/#sec-6 specification, which is defined as "any Unicode code unit". I think that can be ignored though.

Both of our JSON parser lexers (dbLex.l and yajl_lex.c) follow those strict specifications when it comes to the set of characters allowed inside strings. I could change them to allow unescaped tab characters inside strings (please comment if you have an opinion about that either way), but I will have to fix both parsers to do that.

I agree that the error messages you got aren't particularly helpful. The character being complained about is the initial double-quote at the start of the string – the lexer couldn't match the whole string because of the illegal character between the quotes, so it back-tracked to the very start of the it and complained about the quote itself. This also explains the "funny" result with the BEL character, which isn't currently legal anywhere in a .db file.

To give a more friendly error message here I can add error-matching patterns that recognize anything that looks like a string to a human but doesn't to the strict lexer, then tell the user what's wrong with their string. The first part should be relatively straightforward to code, although I'm not sure I want to write code that could analyze any kind of broken string and explain why it isn't a legal JSON string.

I take that back, this isn't a bug so much as a request that we be more liberal than the JSON-5 spec permits. I might be open to that, but first we should make the error easier to understand. Here's what's happening:

The values of info() and field() entries in a database file are supposed to be valid JSON-5 values, and are currently parsed as such.

Looking at the JSON spec https://datatracker.ietf.org/doc/html/rfc7159 and the diagrams on https://www.json.org/json-en.html that spec doesn't actually allow *any* unescaped control characters inside string values. The spec says:

A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks, except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

Thus all control characters are supposed to be escaped according to the older JSON rules.

The JSON-5 spec at https://spec.json5.org/#strings has similar language, although its BNF also confusingly allows an unescaped SourceCharacter from the ECMAScript language spec at https://262.ecma-international.org/5.1/#sec-6 specification, which is defined as "any Unicode code unit". I think that can be ignored though.

Both of our JSON parser lexers (dbLex.l and yajl_lex.c) follow those strict specifications when it comes to the set of characters allowed inside strings. I could change them to allow unescaped tab characters inside strings (please comment if you have an opinion about that either way), but I will have to fix both parsers to do that.

I agree that the error messages you got aren't particularly helpful. The character being complained about is the initial double-quote at the start of the string – the lexer couldn't match the whole string because of the illegal character between the quotes, so it back-tracked to the very start of the it and complained about the quote itself. This also explains the "funny" result with the BEL character, which isn't currently legal anywhere in a .db file.

To give a more friendly error message here I can add error-matching patterns that recognize anything that looks like a string to a human but doesn't to the strict lexer, then tell the user what's wrong with their string. The first part should be relatively straightforward to code, although I'm not sure I want to write code that could analyze any kind of broken string and explain why it isn't a legal JSON string.

Revision history for this message

Andrew Johnson (anj) wrote on 2021-08-17:

#7

This change

diff --git a/modules/database/src/ioc/dbStatic/dbLex.l b/modules/database/src/ioc/dbStatic/dbLex.l
index 33185133c..19812c9da 100644
--- a/modules/database/src/ioc/dbStatic/dbLex.l
+++ b/modules/database/src/ioc/dbStatic/dbLex.l
@@ -130,6 +130,14 @@ static int yyreset(void)

<INITIAL,JSON>{whitespace} ;

+<JSON>{doublequote}({stringchar}|{escape})*{doublequote} { /* bad JSON */
+ yyerrorAbort("Bad character in JSON string");
+}
+
+<JSON>{singlequote}({stringchar}|{escape})*{singlequote} { /* bad JSON */
+ yyerrorAbort("Bad character in JSON string");
+}
+
<INITIAL,JSON>. {
char message[40];
YY_BUFFER_STATE *dummy=0;

results in this error message:

woz$ softIoc -d ~/db/test.db
Error: Bad character in JSON string
at or before ""a b"" in file "/Users/anj/db/test.db" line 5
Error: Invalid character ')'
Error: syntax error
dbLoadRecords: failed to load '/Users/anj/db/test.db'
Error: Failed to load: /Users/anj/db/test.db

which is better, although not perfect because of the messages afterwards. Instead of just calling yyerrorAbort() it could take a look at the string (found in yytext) and explain what's wrong with it. Contributions welcomed...

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#8

JSON also requires UTF-8 (which is not really supported anywhere else in EPICS, AFAIK). Thus using non-ASCII Latin-1 chars would be illegal. But it does not complain about Latin-1 encoded äöü which may have their place in DESC for example, or μ and ° in EGU.

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#9

To clarify: I’m not complaining that the parser accepts Latin-1. Please keep it that way.
Rather our understanding of a string should not be limited by JSON.

Revision history for this message

Dirk Zimoch (dirk.zimoch) wrote on 2021-08-17:

#10

I like https://spec.json5.org/#parsers „A JSON5 parser may accept non-JSON5 forms or extensions“

Revision history for this message

Andrew Johnson (anj) wrote on 2021-08-17:

#11

The dbLex.l parser doesn't try to understand UTF-8 and will accept any characters from \x80-\xff in the string for the second parameter of a field() or info() entry. If you want us to keep that behavior you should generate a PR adding tests for that capability, with comments in the test giving the reasoning for them.

What happens to a string after it has been saved by dbPutString() or dbPutInfo() may be more challenging though. If a field is a JSON link it's parsed during iocInit() by the yajl parser, and that probably does require the string to be valid UTF-8. Similarly the QSRV code which parses info tags named "Q:group" uses the yajl parser. I think your update to the dbpf command calls the yajl parser so value strings there may have to be UTF-8, etc.

EPICS Base

Misleading Error: Invalid character '"'

Bug Description

Other bug subscribers

Remote bug watches