Comment 7 for bug 1935037

Revision history for this message
Dirk Zimoch (dirk.zimoch) wrote (last edit ):

In order to allow easy structure and array access while having the most flexibility in field names, I suggest to define a *small* set of disallowed chars:
* control codes and space (<= 0x20)
* structure and array access chars: . [ ]
* MAYBE some few other chars like $() because of macros and \ and " because they need to be escaped in JSON.

I would generally allow all other unicode (UTF-8) chars, in particular accented/modified Latin characters like äöüëßáóúíýàòùǹñøœåš, punktuation like {}'-+/*;:#% and Asian characters like 文本, ข้อความ. There is no reason (any more since unicode) to force everyone in the world to the limited charset used in English. The JSON parser understands unicode (that is part of the JSON standard).

Also UTF-8 is not particularly difficult to use:
* Nothing changes for English (ASCII)
* A 0x00 byte only appears in \u0000 a the end of a string, thus functions like strlen() and strdup() work perfectly fine getting the right amount of memory bytes.
* A multibyte character is easy to detect (high bit set) and traversing backwards through multibyte characters is easy too, thus truncated strings (-> strncpy) are easy to fix if they cut a character in half at the end.
* Calculating the screen width (number of chars) for adjusting screen output is easy too. (Just don't use strlen() for that).

Instead of blocking the use of UTF-8 with new restrictions, we should rather aim to support it fully in EPICS DBs (record names, string values, macro names).