summary "ALPHANUMERIC/DIGIT-CHAR-P invariant broken with Unicode"
status triaged
importance medium
done
Ken Harris <email address hidden> writes:
> The Hyperspec page for ALPHANUMERICP even makes this relationship
> explicit:
>
> (alphanumericp x)
> == (or (alpha-char-p x) (not (null (digit-char-p x))))
I haven't thought this through properly, but I think that my preferred
resolution to this invariance breakage is actually to restrict
digit-char-p to the ascii set, rather than extending it to fullwidth
digit variants and similar. The reason I say that is that if you expect
(digit-char-p #\FULLWIDTH_DIGIT_TWO 11) to be 2 (and I agree that that's
reasonable, if not the only possible thing) you might also expect
(digit-char-p #\FULLWIDTH_LATIN_CAPITAL_LETTER_A 11) to be 10, which is
perhaps a little more surprising but still not impossible, because we
could just take compatibility decompositions of characters, right?
Except that then (digit-char-p #\FEMININE_ORDINAL_INDICATOR 11) would
also be 10, which is frankly not expected at all.
Of course, restricting digit-char-p to interpreting only ascii digits as
numbers is irritating to those who want to work with Unicode. But I
think the answer to that is to provide and export richer Unicode
functionality, so that users can legitimately work with the Unicode data
that we store. (In my own slow way I am working on this; my github fork
of sbcl has an update to Unicode 6.2 and the beginnings of
normalization, sadly not yet complete).
summary "ALPHANUMERIC/ DIGIT-CHAR- P invariant broken with Unicode"
status triaged
importance medium
done
Ken Harris <email address hidden> writes:
> The Hyperspec page for ALPHANUMERICP even makes this relationship
> explicit:
>
> (alphanumericp x)
> == (or (alpha-char-p x) (not (null (digit-char-p x))))
I haven't thought this through properly, but I think that my preferred DIGIT_TWO 11) to be 2 (and I agree that that's LATIN_CAPITAL_ LETTER_ A 11) to be 10, which is ORDINAL_ INDICATOR 11) would
resolution to this invariance breakage is actually to restrict
digit-char-p to the ascii set, rather than extending it to fullwidth
digit variants and similar. The reason I say that is that if you expect
(digit-char-p #\FULLWIDTH_
reasonable, if not the only possible thing) you might also expect
(digit-char-p #\FULLWIDTH_
perhaps a little more surprising but still not impossible, because we
could just take compatibility decompositions of characters, right?
Except that then (digit-char-p #\FEMININE_
also be 10, which is frankly not expected at all.
Of course, restricting digit-char-p to interpreting only ascii digits as
numbers is irritating to those who want to work with Unicode. But I
think the answer to that is to provide and export richer Unicode
functionality, so that users can legitimately work with the Unicode data
that we store. (In my own slow way I am working on this; my github fork
of sbcl has an update to Unicode 6.2 and the beginnings of
normalization, sadly not yet complete).