ALPHANUMERIC/DIGIT-CHAR-P invariant broken with Unicode

Bug #1177986 reported by Ken Harris on 2013-05-08
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Medium
Unassigned

Bug Description

DIGIT-CHAR-P only returns T for the 10 ASCII digits, when radix <= 10, even though it recognizes non-ASCII Unicode digit characters in other contexts.

ALPHANUMERICP is defined in SBCL by checking the UCD-GENERAL-CATEGORY: <5, or =12.

ALPHA-CHAR-P is defined in SBCL by checking the UCD-GENERAL-CATEGORY: <5.

DIGIT-CHAR-P is a little different because it can also take an optional "radix" argument. Still, it should make sense that anything with UCD-GENERAL-CATEGORY =12 should match when radix=10 (the default).

The Hyperspec page for ALPHANUMERICP even makes this relationship explicit:

     (alphanumericp x)
       == (or (alpha-char-p x) (not (null (digit-char-p x))))

In SBCL 1.1.3 (with :SB-UNICODE in *FEATURES*), this isn't always the case. (1.1.3 isn't the latest release, but this function doesn't appear to have been updated since then.)

For example, consider #\FULLWIDTH_DIGIT_TWO (U+FF12). It's in category "Number, Decimal Digit [Nd]", so one might reasonably think it would pass DIGIT-CHAR-P. But it doesn't -- even though it's ALPHANUMERICP:

    * (digit-char-p #\FULLWIDTH_DIGIT_TWO)
    NIL

    * (alphanumericp #\FULLWIDTH_DIGIT_TWO)
    T
    * (or (alpha-char-p #\FULLWIDTH_DIGIT_TWO) (not (null (digit-char-p #\FULLWIDTH_DIGIT_TWO))))
    NIL

Internally, it looks like SBCL does recognize that it's a digit, with value 2:

    * (sb-impl::ucd-decimal-digit #\FULLWIDTH_DIGIT_TWO)
    2

It seems like DIGIT-CHAR-P's "Special-case decimal and smaller radices" is what's causing the problem. If you ask if this character is a digit in base-11, SBCL reports that it is:

    * (digit-char-p #\FULLWIDTH_DIGIT_TWO 11)
    2

I expect that any character that returns a value 0-9 from DIGIT-CHAR-P with radix=11 should also return that value when radix=10.

SIMPLE TEST CASE:

This code returns a list of all characters which don't meet the Hyperspec's equivalence mentioned above:

    (defconstant +all-chars+
      (loop for i from 0 upto (1- char-code-limit)
            collect (code-char i)))
    (loop for x in +all-chars+
          when (not (eq (alphanumericp x)
                        (or (alpha-char-p x) (not (null (digit-char-p x))))))
          collect x)

It should return the empty list, but returns 401 characters here.

VERSION INFORMATION:

$ sbcl --version
SBCL 1.1.3

$ uname -a
Darwin Ken-Harris-no-Mac-Pro.local 11.4.0 Darwin Kernel Version 11.4.0: Mon Apr 9 19:32:15 PDT 2012; root:xnu-1699.26.8~1/RELEASE_X86_64 x86_64

* *features*
(:ALIEN-CALLBACKS :ANSI-CL :BSD :C-STACK-IS-CONTROL-STACK :COMMON-LISP
 :COMPARE-AND-SWAP-VOPS :COMPLEX-FLOAT-VOPS :CYCLE-COUNTER :DARWIN :DARWIN
 :DARWIN9-OR-BETTER :FLOAT-EQL-VOPS :GENCGC :IEEE-FLOATING-POINT
 :INLINE-CONSTANTS :INODE64 :LINKAGE-TABLE :LITTLE-ENDIAN
 :MACH-EXCEPTION-HANDLER :MACH-O :MEMORY-BARRIER-VOPS :MULTIPLY-HIGH-VOPS
 :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-DLADDR :OS-PROVIDES-DLOPEN
 :OS-PROVIDES-PUTWC :OS-PROVIDES-SUSECONDS-T :RAW-INSTANCE-INIT-VOPS :SB-DOC
 :SB-EVAL :SB-LDB :SB-PACKAGE-LOCKS :SB-SOURCE-LOCATIONS :SB-TEST :SB-THREAD
 :SB-UNICODE :SBCL :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-FIXED-OBJECTS
 :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-VECTORS
 :STACK-GROWS-DOWNWARD-NOT-UPWARD :UD2-BREAKPOINTS :UNIX
 :UNWIND-TO-FRAME-AND-CALL-VOP :X86-64)

Ken Harris (kengruven+lp) wrote :

 summary "ALPHANUMERIC/DIGIT-CHAR-P invariant broken with Unicode"
 status triaged
 importance medium
 done

Ken Harris <email address hidden> writes:

> The Hyperspec page for ALPHANUMERICP even makes this relationship
> explicit:
>
> (alphanumericp x)
> == (or (alpha-char-p x) (not (null (digit-char-p x))))

I haven't thought this through properly, but I think that my preferred
resolution to this invariance breakage is actually to restrict
digit-char-p to the ascii set, rather than extending it to fullwidth
digit variants and similar. The reason I say that is that if you expect
(digit-char-p #\FULLWIDTH_DIGIT_TWO 11) to be 2 (and I agree that that's
reasonable, if not the only possible thing) you might also expect
(digit-char-p #\FULLWIDTH_LATIN_CAPITAL_LETTER_A 11) to be 10, which is
perhaps a little more surprising but still not impossible, because we
could just take compatibility decompositions of characters, right?
Except that then (digit-char-p #\FEMININE_ORDINAL_INDICATOR 11) would
also be 10, which is frankly not expected at all.

Of course, restricting digit-char-p to interpreting only ascii digits as
numbers is irritating to those who want to work with Unicode. But I
think the answer to that is to provide and export richer Unicode
functionality, so that users can legitimately work with the Unicode data
that we store. (In my own slow way I am working on this; my github fork
of sbcl has an update to Unicode 6.2 and the beginnings of
normalization, sadly not yet complete).

summary: - DIGIT-CHAR-P not correct for non-ASCII digit chars
+ ALPHANUMERIC/DIGIT-CHAR-P invariant broken with Unicode
Changed in sbcl:
importance: Undecided → Medium
status: New → Triaged
Ken Harris (kengruven+lp) wrote :

Hmm, I see your point. I'd like to suggest a third possible option.

The Unicode standard has another flag on each character: Hex_Digit. This includes characters like FULLWIDTH_DIGIT_TWO and FULLWIDTH_LATIN_CAPITAL_LETTER_A, but not FEMININE_ORDINAL_INDICATOR or SUPERSCRIPT_TWO.

[http://en.wikipedia.org/wiki/Unicode_numerals#Hexadecimal_digits]

I don't know the exact definition of this flag, but it seems to me to be things which a user might reasonably use as hex digits, without either trying to be either super-tricky (and throwing weird numerals at us), or needing to change keyboard layouts (in order to get pure-ASCII from their number keys).

The big upside (and the reason that Unicode provides these properties, I believe) is that a user with a Japanese/Chinese/Korean keyboard setting can press the key marked "5" on their keyboard, and their software will recognize the fullwidth digit as the digit 5, even though it's not ASCII "5". It also means we only need to add the 16 "Fullwidth Form" digits, and don't need to do any decomposition.

The only downside I see is that this doesn't really scale beyond radix=16, but I think that allowing the Unicode Hex_Digit set up to hex, and then only ASCII for "g"/"G" through "z"/"Z" would be a fair compromise. I don't think I've ever actually seen a program that relied on parsing numbers of radix higher than 16, using the 0-9,A-Z set. (There's Base-64 encoding, but that uses a different ordering, and is case-sensitive, and adds other symbols at the end -- you can't use Common Lisp's numeric reading/printing support for that, anyway, no matter what we choose here.) Clearly it can't be that important to support Unicode decomposition out there, since SBCL has never supported non-ASCII letters for this.

I would be perfectly happy saying that G-Z radix support is ASCII-only, to meet the specification, and radix<=16 also works with the 16 Unicode fullwidth forms.

Ken Harris (kengruven+lp) wrote :

To keep the ball rolling on this, I wrote a sample (inefficient) implementation of my latter idea. No Unicode decomposition involved, and the ALPHANUMERICP invariant works. It follows the Decimal_Digit and Hex_Digit properties, plus adds ASCII letters G-Z for higher radices (which I consider a 'legacy' part of the CL spec, as I've never seen it used).

While writing this, I came up with another reason that DIGIT-CHAR-P should return T for non-ASCII digits: consistency with ALPHA-CHAR-P. Since ALPHA-CHAR-P returns T for (lots of) non-ASCII alphabetic characters, I would expect DIGIT-CHAR-P to do the same for non-ASCII digit characters.

Another possible choice, then, would be to make both ALPHA-CHAR-P and DIGIT-CHAR-P only return T for ASCII characters. It'd be internally consistent. I don't personally think that would be preferable to extending DIGIT-CHAR-P to other Unicode digits, but I would accept that that's one way to solve this problem -- especially if SBCL is going to be adding more powerful Unicode functionality. (Then we'd probably end up with a "trivial-unicode" package, to unify what all the different compilers do. Again, not my favorite solution, since I think we'd be throwing away the flexibility that the CL spec gave us here, and making people learn and use a completely new set of functions for Unicode-aware programs.)

My patch restores the CLHS alphanumerp invariant by fixing the code in digit-char-p.

I might go back and implement hex digit support at a later date, but this patch at least fixes standards-compliance.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers