Unassigned Unicode codepoints are reported as upper-case alphabetic characters with decimal value 0

Bug #1178038 reported by Ken Harris
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Fix Released
Low
Christophe Rhodes

Bug Description

SBCL's character functions report unassigned Unicode codepoints as upper-case letters.

For example, U+0378 is unassigned (as of Unicode 6.2: http://www.unicode.org/charts/PDF/U0370.pdf). But in SBCL:

    * (alpha-char-p #\u0378)
    T ;; expected: NIL

Internal functions are also affected. Here's why they claim to be uppercase letters:

    * (sb-impl::ucd-general-category #\u0378)
    0 ;; "Lu" -- see *general-categories* in ucd.lisp -- expected: whatever index corresponds to "Cn" ("Unassigned")

They also claim to represent the decimal value 0:

    * (sb-impl::ucd-decimal-digit #\u0378)
    0 ;; expected: NIL

One part of the problem could be that SBCL lacks the general category "Cn" ("Unassigned"):

    SBCL general categories: https://github.com/sbcl/sbcl/blob/master/tools-for-build/ucd.lisp#L169-L172
    Unicode general categories: http://www.unicode.org/reports/tr44/#Property_Values

I don't know that it's as easy as adding "Cn" to this list, though, because there's code in SBCL (like in target-char.lisp) that checks general category by index, like (< gc 5) or (= gc 12). Adding a new value here would change the indexes. (Maybe it's safe to add to the end?)

VERSION INFORMATION:

$ sbcl --version
SBCL 1.1.3

$ uname -a
Darwin Ken-Harris-no-Mac-Pro.local 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64

* *features*
(:ALIEN-CALLBACKS :ANSI-CL :BSD :C-STACK-IS-CONTROL-STACK :COMMON-LISP
 :COMPARE-AND-SWAP-VOPS :COMPLEX-FLOAT-VOPS :CYCLE-COUNTER :DARWIN :DARWIN
 :DARWIN9-OR-BETTER :FLOAT-EQL-VOPS :GENCGC :IEEE-FLOATING-POINT
 :INLINE-CONSTANTS :INODE64 :LINKAGE-TABLE :LITTLE-ENDIAN
 :MACH-EXCEPTION-HANDLER :MACH-O :MEMORY-BARRIER-VOPS :MULTIPLY-HIGH-VOPS
 :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-DLADDR :OS-PROVIDES-DLOPEN
 :OS-PROVIDES-PUTWC :OS-PROVIDES-SUSECONDS-T :RAW-INSTANCE-INIT-VOPS :SB-DOC
 :SB-EVAL :SB-LDB :SB-PACKAGE-LOCKS :SB-SOURCE-LOCATIONS :SB-TEST :SB-THREAD
 :SB-UNICODE :SBCL :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-FIXED-OBJECTS
 :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-VECTORS
 :STACK-GROWS-DOWNWARD-NOT-UPWARD :UD2-BREAKPOINTS :UNIX
 :UNWIND-TO-FRAME-AND-CALL-VOP :X86-64)

Revision history for this message
Christophe Rhodes (csr21-cantab) wrote : Re: [Bug 1178038] [NEW] Unassigned Unicode codepoints are reported as upper-case alphabetic characters with decimal value 0

Ken Harris <email address hidden> writes:

 status inprogress
 importance low
 assignee csr21-cantab
 done

> One part of the problem could be that SBCL lacks the general category
> "Cn" ("Unassigned"):
>
> SBCL general categories: https://github.com/sbcl/sbcl/blob/master/tools-for-build/ucd.lisp#L169-L172
> Unicode general categories: http://www.unicode.org/reports/tr44/#Property_Values
>
> I don't know that it's as easy as adding "Cn" to this list, though,
> because there's code in SBCL (like in target-char.lisp) that checks
> general category by index, like (< gc 5) or (= gc 12). Adding a new
> value here would change the indexes. (Maybe it's safe to add to the
> end?)

It is basically safe to add to the end, with some wrinkles in buiding
the tables in the first place. I have a fix for this, unfortunately
currently tangled up in the middle of all the rest of the Unicode tree
that I'm working on.

Changed in sbcl:
assignee: nobody → Christophe Rhodes (csr21-cantab)
importance: Undecided → Low
status: New → In Progress
Changed in sbcl:
status: In Progress → Fix Committed
information type: Public → Public Security
Stas Boukarev (stassats)
information type: Public Security → Public
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.