Steel Bank Common Lisp

Unassigned Unicode codepoints are reported as upper-case alphabetic characters with decimal value 0

Reported by Ken Harris on 2013-05-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Low
Christophe Rhodes

Bug Description

SBCL's character functions report unassigned Unicode codepoints as upper-case letters.

For example, U+0378 is unassigned (as of Unicode 6.2: http://www.unicode.org/charts/PDF/U0370.pdf). But in SBCL:

    * (alpha-char-p #\u0378)
    T ;; expected: NIL

Internal functions are also affected. Here's why they claim to be uppercase letters:

    * (sb-impl::ucd-general-category #\u0378)
    0 ;; "Lu" -- see *general-categories* in ucd.lisp -- expected: whatever index corresponds to "Cn" ("Unassigned")

They also claim to represent the decimal value 0:

    * (sb-impl::ucd-decimal-digit #\u0378)
    0 ;; expected: NIL

One part of the problem could be that SBCL lacks the general category "Cn" ("Unassigned"):

    SBCL general categories: https://github.com/sbcl/sbcl/blob/master/tools-for-build/ucd.lisp#L169-L172
    Unicode general categories: http://www.unicode.org/reports/tr44/#Property_Values

I don't know that it's as easy as adding "Cn" to this list, though, because there's code in SBCL (like in target-char.lisp) that checks general category by index, like (< gc 5) or (= gc 12). Adding a new value here would change the indexes. (Maybe it's safe to add to the end?)

VERSION INFORMATION:

$ sbcl --version
SBCL 1.1.3

$ uname -a
Darwin Ken-Harris-no-Mac-Pro.local 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64

* *features*
(:ALIEN-CALLBACKS :ANSI-CL :BSD :C-STACK-IS-CONTROL-STACK :COMMON-LISP
 :COMPARE-AND-SWAP-VOPS :COMPLEX-FLOAT-VOPS :CYCLE-COUNTER :DARWIN :DARWIN
 :DARWIN9-OR-BETTER :FLOAT-EQL-VOPS :GENCGC :IEEE-FLOATING-POINT
 :INLINE-CONSTANTS :INODE64 :LINKAGE-TABLE :LITTLE-ENDIAN
 :MACH-EXCEPTION-HANDLER :MACH-O :MEMORY-BARRIER-VOPS :MULTIPLY-HIGH-VOPS
 :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-DLADDR :OS-PROVIDES-DLOPEN
 :OS-PROVIDES-PUTWC :OS-PROVIDES-SUSECONDS-T :RAW-INSTANCE-INIT-VOPS :SB-DOC
 :SB-EVAL :SB-LDB :SB-PACKAGE-LOCKS :SB-SOURCE-LOCATIONS :SB-TEST :SB-THREAD
 :SB-UNICODE :SBCL :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-FIXED-OBJECTS
 :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-VECTORS
 :STACK-GROWS-DOWNWARD-NOT-UPWARD :UD2-BREAKPOINTS :UNIX
 :UNWIND-TO-FRAME-AND-CALL-VOP :X86-64)

Ken Harris <email address hidden> writes:

 status inprogress
 importance low
 assignee csr21-cantab
 done

> One part of the problem could be that SBCL lacks the general category
> "Cn" ("Unassigned"):
>
> SBCL general categories: https://github.com/sbcl/sbcl/blob/master/tools-for-build/ucd.lisp#L169-L172
> Unicode general categories: http://www.unicode.org/reports/tr44/#Property_Values
>
> I don't know that it's as easy as adding "Cn" to this list, though,
> because there's code in SBCL (like in target-char.lisp) that checks
> general category by index, like (< gc 5) or (= gc 12). Adding a new
> value here would change the indexes. (Maybe it's safe to add to the
> end?)

It is basically safe to add to the end, with some wrinkles in buiding
the tables in the first place. I have a fix for this, unfortunately
currently tangled up in the middle of all the rest of the Unicode tree
that I'm working on.

Changed in sbcl:
assignee: nobody → Christophe Rhodes (csr21-cantab)
importance: Undecided → Low
status: New → In Progress
Changed in sbcl:
status: In Progress → Fix Committed
information type: Public → Public Security
Stas Boukarev (stassats) on 2013-05-19
information type: Public Security → Public
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers